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ABSTRACT 



A parallel processing system is receptive of a program 
and has at least two processors connected in parallel to 
a shared main memory. Each processor executes in- 
structions of the program which includes side-effecting 
instructions which modify the contents of a location in 
main memory and functional instructions which refer- 
ence locations in main memory. The program is com- 
piled into a series of independent instruction blocks 
each of which includes predominantly functional in- 
structions and terminates in a side-effecting instruction 
and the processors execute the blocks in parallel. 

13 Claims, 4 Drawing Sheets 
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algorithm in terms of the internal state of a set of objects 

SYSTEM AND METHOD FOR PARALLEL as, for example, discussed in chapter 3 of Abelson and 

PROCESSING WITH MOSTLY FUNCTIONAL Sussman, Structure and Interpretation of Computer 

LANGUAGES Programs, M.I.T. Press, Cambridge, (1985). This object 

5 oriented viewpoint in programming language design is 

BACKGROUND OF THE INVENTION incompatible with the strictly functional approach to 

The present invention relates to a computer system computation. The same algorithms can be implemented 

and a method for parallel processing program instruc- in either style, and both systems are surely univer- 

tions. sal — but the fact that programmers find one representa- 

Many parallel architectures start by defining the pro- 10 tion for programs easier to think about, easier to design, 

gramming model they support as strictly functional. and easier to debug is, in itself, a powerful motivation 

Computer architects find that the strictly functional for a programming language to provide that representa- 

approach makes their task easier, because the order in tion. 

which computation is performed, once the basic depen- If this were merely a representation issue, there 

dencies are satisfied, is irrelevant. Building highly paral- 15 would be hope that suitable compiler technology could 

lei machines which execute strictly functional Ian- eventually lead to a programming language which pro- 

guages then becomes relatively straightforward. vides support for side effects, but which compiles into 

Strictly functional languages certainly have advantages, purely functional operations. As has been discussed 

including the relative ease of proving their correctness above, however, in the general case this is not possible, 

and the ease of thinking and reasoning about certain 20 side effects cause problems. Architects find it diffi- 

simple kinds of programs, as is disclosed in Backus, cu i t to build parallel architectures for supporting them. 

"Can Programming be Liberated from the von Neu- Verification software finds such programs difficult to 

mann Style? A Functional Style and its Algebra of reason about And pr0 g ra mming styles, with an abun- 

Programs" Communications of the ACM Vol. 21 no. 8 dance of side effects ^ difficu i t t0 understand, modify 

pp. 613-641 (August 1978). 25 ^ maintain 

Unfortunately there is a class of algorithms for In accordance with the present mvent i n, instead of 

which the strictly functional programming style ap- letel eliminating side effec t s , it is proposed that 

pears to be inadequate. The problems are rooted in the ^ use be severd curtaUed Mos( ^ ^ ^ CQn . 

difficulty in interfacing a strictly functional program to t - , __. , ,., ,-. . 

c s i „ • jf • * t* • • -ui * in ventional programming languages like Fortran are gra- 

a non-functional outside environment. It is impossible to 30 n 7 i • j-«- i^ ,^ * ,• 

program a task such as an operating system, for in- tult0US ' 7**? Me "? *° 1 ™ S <*". m ^ ias ^S 
stance, in the unmodified strictly functional style. communications problems. Nor are they improving the 
Clever modifications of the strict functional style have lar ^ s ^ e »°dulanty or c ant y °f the program. In- 
been proposed which cure this defect. For example, st f d ' ***% m used for tavlal Ptoses such as updating 
McCarthy's AMB operator, which returns whichever 35 a ! °°P mde . x vamble. 

of two inpute are ready first, appears to be adequate to Elimination of these unnecessary side effects can 

solve this and similar problems. Several equivalent op- make code more readable, more maintainable, and as set 

erators have been proposed in Peter Henderson, "Is It forth hereinafter in accordance with the invention, 

Reasonable to Implement a Complete Programming faster to execute. 

System in a Purely Functional Style? " Technical memo 40 Man y of these same issues motivate Halstead's Multi- 

PMM/94, The University of Newcastle upon Tyne Lisp, a parallel variant of Scheme as disclosed in "Mul- 

Computmg Laboratory (December 1980). tiLis P : A Language for Concurrent Symbolic Computa- 

However, the introduction of such non-functional tion " A CM Transactions on Programming Languages 

operators into the programming language destroys md Systems, (1985). The approach Multi-Lisp takes to 

many of its advantages. Architects can no longer ignore 45 providing access to parallel computation is the addition 

the order of execution of functions. Proving program of programmer visible primitives to the language. The 

correctness again becomes a very much more difficult three primitives which distinguish it from conventional 

task. Users also find the programming with such opera- Lisp are: 

tors is tedious and error prone. In particular, as shown The future primitive which allows an encapsulated 

by Agha in "Actors: A Model of Concurrent Computa- 50 value to be evaluated, while simultaneously returning to 

tion in Distributed Systems" M.I.T. Artificial Intelli- its caller a promise for that value. The caller can con- 

gence Laboratory Technical Memo 844, (1985), the tinue to compute with this returned object, incorporat- 

presence of such an operator allows the definition of a ing it into data structures, and passing it to other func- 

new data type, the cell, which can be side effected in tions. Only when the value is examined is it necessary 

just as normal memory locations in non-functional Ian- 55 for the parallel computation of its value to complete; 

guages. Thus the introduction of the AMB operator, the pcall primitive which allows the parallel evalua- 

while adequate to solve the lack of side effect problem, tion of arguments to a function. Halstead teaches that 

re-introduces many of the problems which the func- pcall can be implemented as a simple macro which 

tional language advocates are attempting to avoid. expands into a sequence of futures. It can be thought of 

Therefore conventional thought indicated the choice 60 as syntactic sugar for a stylized use of futures; and 

of either defining weak systems, such as purely func- the delay primitive which allows a programmer to 

tional languages, about which one can prove theorems, specify that a particular computation be delayed until 

or defining strong systems which one can prove little the result is needed. Similar in some respects to a future, 

about. delay returns a promise to compute the value, but does 

Much more serious than any of the theoretical diffi- 65 not begin computation of a value until the result is 

culties of strictly functional languages, is the difficulty needed. Thus the delay primitive is not a source of 

programmers face in implementing certain types of parallelism in the language. It is a way of providing lazy 

programs. It is often more natural to think about an evaluation semantics to the programmer. 
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Both delay and future result in an order of execution programmer that the program is being executed sequen- 
different from applicative order computation. In the tially. This automatic extraction of parallelism makes 
absence of side effects, both will result in the same value the architecture attractive because one can exploit the 
as the equivalent program without the primitives, since existing programming tools, languages and algorithms 
they affect only the order in which the computation is 5 which have been developed over the past several dec- 
performed. This is not strictly true for delay since its ades of computer science research, while gradually 
careful use can allow otherwise non-terminating pro- endeavoring to improve them for the parallel environ- 
grams to return a value. ment. While not all of the benefits of parallel execution 

In the presence of side effects, it is difficult to predict will be attained with this architecture, the ability to 

the behavior of a future, since its value may be com- 10 extract even modest amounts of parallel execution from 

puted in parallel with other computations. existing programs may be worth the effort. Moreover, 

While the value of a delay is deterministic, since it the architecture can be extended, to provide some ex- 
does not introduce additional parallelism, its time of plicit and modular programmer control over the paral- 
computation is dependent on when its value is first lelism. 

examined. This can be very non-intuitive and difficult to 15 In achieving the invention, just as hardware has pro- 
think about while writing programs. vided important runtime support for type checking, it is 

Halstead implements the future and delay primitives proposed to use hardware in the runtime checking for 

by returning the caller an object of data-type future. future declarations. 

This object contains a pointer to a slot, which will even- In a proposed Lisp implementation, explicit language 
tually contain the value returned by the promised com- 20 extensions such as future and pcall are not used. Instead, 
putation. The future may be stored in data structures it is assumed that each evaluation is encapsulated in a 
and passed as a value. Computations which attempt to future and that each call can freely evaluate its argu- 
reference the value of the future prior to its computa- ments in parallel. The .approach is to detect and correct 
tion are suspended. When the promised computation those cases in which this assumption is unjustified, 
completes, the returned value is stored in the specified 25 In accordance with the invention a compiler trans- 
slot, and the future becomes determined. This allows forms the program into sequences of compiled primitive 
any pending suspended procedures waiting for this instructions, called transaction blocks. Each transaction 
value to run, and any further references to the future block has a mostly functional portion and terminates in 
will simply return the now computed value. a side effect. While the mostly functional portion prefer- 

The use of future or delay can be thought of as a 30 ably has no side effecting instructions, the system will 

declaration. Their use declares that either the computa- operate effectively as long as at least the majority of the 

tion done within the scope of the delay or future has no instructions are functional, that is only reference but do 

side effects, or that, if it does, the order in which those not modify main memory. 

side effects are done relative to other computations is Each block is independent of previous execution, 

irrelevant. There is also a guarantee with such a decla- 35 except for the contents of main memory. In particular, 

ration that no free or shared variable referenced by the no registers or other internal bits in the processor are 

computation is side effected by some other parallel- shared across blocks, 

executing portion of the program. Each block contains exactly one side-effecting store 

Like all declarations, the use of future or delay is a into main memory, occurring at the end of the block, 

very strong assertion. In a way similar to type declara- 40 Each transaction block is, from the standpoint of the 

tions, their use is difficult to check automatically, is memory system, a mostly functional program consisting 
error prone, has a significant performance impact if of register loads and register-to-register instructions, 

omitted, and may function correctly for many test cases, intermediate side effecting stores and ending with side 

but fail unexpectedly on others. effecting store into main memory. The results of the 

Advocates of strong type checking in compiled Ian- 45 side effecting stores are temporarily stored, before 
guages have attempted for years to build compilers being applied to main memory, 
capable of proving the type correctness of programs. It Each of these blocks can be executed independently, 
is believed that they have failed. All languages sophisti- on several different processors. Moreover, there is no 
cated enough for serious programming require at least interaction between blocks until the blocks are con- 
some dynamic checks for type safety at execution time. 50 finned, that is the results are validated. However, the 
These checks are implemented as additional instructions order in which the blocks are confirmed is critical. The 
on conventional architectures, or as part of the normal side effects must be executed in a well defined order: the 
instruction execution sequence on more recent architec- order specified in the program. Further, the execution 
tures as discussed in Moon, "The Architecture of the of side effect can potentially modify a location which 
Symbolics 3600" The 12th Annual International Sym- 55 some other block has already read, 
posium on Computer Archecture pp. 76-83 (1985). One These two tasks — confirming in order, and enforcing 
alternative, declaring the types of relying on the word the dependencies of later blocks on earlier side effect- 
of the programmer, while leading to good performance s — require special hardware in accordance with the 
on conventional architectures, is dangerous, error- invention. 

prone, and inappropriate for sophisticated modern pro- 60 Each processor is free to execute its block at will. The 

gramming environments. results of the side effects are stored in a cache associated 

with the processor. As a result, this execution cannot 
affect other processors. A block counter is maintained, 

The main object of the present invention is to elimi- similar in purpose to a normal program counter. The 

nate the disadvantages of the prior art systems. 65 block counter's job is to keep track of which block is 

The present invention relates to a computer system next allowed to be confirmed. As the block counter 

having the ability to execute several portions of a pro- reaches each block in turn, the block compares its refer- 
gram in parallel, while giving the appearance to the enced locations with modified location of already con- 
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finned blocks. If there is no comparison, this block is 
confirmed. Confirming a block allows the results stored 
in the cache to be sent to main memory and modify the 
contents of main memory. Other blocks, not yet con- 
firmed, may already have referenced the location which 5 
has just been modified. If they have, then the data upon 
which they are computing is now invalid when they are 
confirmed. 

But the program which they are executing has not 
modified main memory since the results are stored in a 10 
cache. Thus the system is free to abandon the uncon- 
firmed executed block, and to attempt re-executing it a 
second time. This abandoning of an unconfirmed exe- 
cuted block is called aborting the block. 

In order to detect when the block needs to be 15 
aborted, one needs to keep track of which memory 
locations it has referenced. A dependency list is there- 
fore maintained, containing the addresses of all loca- 
tion's in main memory which have been referenced dur- 
ing execution of a block. When a processor executes a 
load instruction, it adds the address being loaded to the 
executing block's dependency list. 

When a block is confirmed, the address it is side ef- 
fecting is broadcast to all other executing processors. 
Each processor checks its dependency list to determine 
if the block it is executing has referenced the modified 
location. If it has, the block is aborted and re-executed. 

Eventually, a given block will be reached by the 
block counter, and it will be allowed to complete and 30 
perform its side-effect. The block counter advances to 
the next block, and if the block is ready, that block is 
confirmed in the following cycle. If the confirmation of 
one block aborts the execution of successor blocks, 
however, the performance will be limited by the time it 35 
takes the aborted block to re-execute its program. 

The performance improvement resulting from this 
technique depends on two factors: the average block 
length, and the frequency of aborts. The longer the 
block, the fewer confirms are necessary to execute a 40 
given program. The more frequent the aborts, the more 
the block counter must await a sequential computation 
before confirming the next block. 

One can see now the influence of reducing the num- 
ber of side effects in the source program: as the number 45 
of side effects is reduced, the length of the blocks can be 
increased. Since the blocks are longer, fewer blocks 
need be executed to perform a computation. Since the 
block execution is sequential, at a rate limited to a maxi- 
mum of one per clock cycle, this limits the performance 50 
of the architecture. There is probably a maximum desir- 
able block length, since as the block size increases, the 
length of the dependency list, and thus the likelihood of 
a side effect from another block influencing the compu- 
tation goes up. 55 

In accordance with the invention, the parallel pro- 
cessing system of the invention is receptive of a pro- 
gram and has at least two processors connected in paral- 
lel to a shared main memory. Each processor executes 
instructions of the program which includes side effect- 60 
ing instructions which modify the contents of a location 
in main memory and functional instructions which ref- 
erence locations in main memory. Means are provided 
for compiling a program into a series of independent 
instruction blocks including predominantly functional 65 
instructions and terminating in a side-effecting instruc- 
tion and wherein the processors execute the blocks in 
parallel. 



The system accordingly further comprises for each 
processor, means for maintaining a dependency list of 
all locations in main memory which have been refer- 
enced during the execution of a block therein, first 
cache means associated with each processor for tempo- 
rarily storing the location and contents of each location 
in main memory to be side effected by the execution of 
the block in the associated processor and means are 
provided for confirming each block in the order of the 
blocks as specified in the program wherein the validity 
of the contents of the first cache means are approved. 
The confirming means comprise means for comparing 
the dependency list of the block to be confirmed with 
the locations side effected by already confirmed blocks 
to detect a match and means for aborting the execution 
of the block to be confirmed if there is a match. 

In a preferred embodiment, the confirming means 
comprises a block counter and means for incrementing 
the block counter to indicate the next block to be con- 
firmed. The means for maintaining the dependency list 
comprises second cache means associated with each 
processor for temporarily storing each location and 
contents of main memory referenced by a block being 
executed in the processor. The confirming means fur- 
ther comprises means for updating the contents of the 
second cache means upon the aborting of a block and 
for effecting the reexecution of an aborted block with 
the side effected location from an already confirmed 
block. 

In one embodiment the first and second cache means 
each comprises a fully associative content addressable 
memory. In the alternative embodiment, the second 
cache means comprises an addressable random access 
memory for storing the dependency information of 
main memory locations and hash table means receptive 
of said addresses of the main memory locations for 
hashing them into a reduced member of address bits for 
addressing the random access memory. 

In accordance with the present invention, the method 
for executing a program in at least two processors con- 
nected in parallel to a shared main memory, comprises 
compiling the program into a series of independent 
instruction blocks including predominantly functional 
instructions and terminating in a side-effecting instruc- 
tion, applying the blocks to the processors and execut- 
ing the blocks in parallel. 

The step of executing comprises maintaining a depen- 
dency list for each processor of all locations in main 
memory which have been referenced during the execu- 
tion of a block therein, for each processor temporarily 
storing the location and contents of each location in 
main memory to be side effected by the execution of the 
block therein and confirming each block in the order of 
the blocks as specified in the program to approve the 
validity of the temporarily stored contents by compar- 
ing the dependency list of the block to be confirmed 
with the locations which have been side effected by the 
already confirmed blocks to detect a match. 

When there is a match the execution of the block is 
aborted. 

The step of maintaining a dependency list preferably 
comprises temporarily storing, for each processor, each 
location and contents of main memory referenced by 
the block being executed therein and wherein the step 
of executing further comprises reexecuting an aborted 
block with the side effected location from an already 
confirmed block. 
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In another embodiment, the step of temporarily stor- 
ing the referenced locations comprises hashing each 
location address into a hashed address having a reduced 
number of address bits and storing a logic one at the 
hashed address in a random access memory. 5 

Moreover, the method includes selectively omitting 
referenced locations from the dependency list when 
desired. 

These and other features of the invention will be 
described in the following description with reference to 10 
the drawings, wherein: 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a block diagram of a system according to 
the invention; 15 

FIG. 2 shows an example of a transaction block ac- 
cording to the invention; 

FIG. 3 shows an example of several blocks executing 
in parallel; 

FIG. 4 is a call tree for guiding the block starting 20 
process according to the invention; 

FIG. 5 is a state transition diagram of the state transi- 
tion logic for the dependency cache; 

FIG. 6 is an alternate embodiment for the depen- 
dency cache shown in FIG. 1. 25 

DETAILED DESCRIPTION OF THE 
INVENTION 

Referring now to FIG. 1, the system in accordance 
with the present invention includes processors la, lb, lc 30 
connected in parallel to common bus 5 to share main 
memories 6 and 7. It should be noted that while three 
processors are shown in parallel for the purpose of 
example, the system according to the present invention 
can utilize at least two processors in parallel up to an 35 
indefinite number in accordance with the method and 
system of the present invention. It should also be noted 
that the present invention is also capable of operating in 
a quasi parallel processor of the type wherein a single 
processor which executes two or more programs in an 40 
interleaved manner is also contemplated as being within 
the scope of the present invention. 

Connected between each processor and the bus 5 are 
a confirm cache 2a, 2b, 2c and a dependency cache 3a, 
3b, 3c which has associated state transition logic 4a, 4b, 45 
4c connected thereto. Processors la, lb and lc also have 
connections to a block counter 8, which will be de- 
scribed hereinafter. 

The system also includes compiler 9 connected to the 
bus in a conventional manner for transforming an input 50 
program into sequences of compiled primitive instruc- 
tions called transaction or instruction blocks. 

Each block, as shown in FIG. 2, has a mostly func- 
tional portion and terminates in a side effecting store 
instruction. As noted, the functional instructions in- 55 
elude those which reference main memory to carry out 
the operations thereof or operate only on internal pro- 
cessor registers, while the side effecting instructions are 
those which require modification of the contents of a 
location in main memory. 60 

The mostly functional portion can have no side ef- 
fecting instructions or one or more side effecting in- 
structions, as long as at least a majority of the instruc- 
tions thereof are functional. 

In accordance with the method of the present inven- 65 
tion, after the compiler 9 has broken up the program 
into the blocks, the blocks are applied to the processors 
la-lc for execution in parallel. Each reference to main 



memory is stored in the dependency cache 3a-3c 
whereas each side effecting instruction is carried out 
and written only into the confirm cache 2a-2c. 

Thus each processor is free to execute its block at will 
since all referenced locations and all locations to be 
modified are stored in the caches 2a-2c and 3a-3c with- 
out the contents of main memory 6, 7 being affected. 

In order to validate the data for being written into 
main memory to modify the contents thereof, each 
block must first be confirmed. The blocks are confirmed 
in the order specified in the overall program and this is 
carried out by the block counter 8 which specifies the 
order in which the processors la-lc will be confirmed. 

The dependency caches 3a-3c, by keeping track of all 
memory locations referenced by the processor during 
the execution of a block, thereby maintains a depen- 
dency list of the data which is depended on by the block 
in order to achieve the execution thereof. After the first 
block is confirmed, all of the side effected locations 
stored in the confirmed cache corresponding thereto is 
placed out on the bus and received by the dependency 
caches of the other blocks to see if any of these side 
effected locations were depended upon by the blocks 
during execution. If this is the case and there is a match, 
then the executed unconfirmed block must be aborted 
and reexecuted with the updated side effected location 
data. 

In a preferred embodiment of the present invention, 
the confirm and dependency caches are fully associative 
content addressable memories which enable the com- 
parison to be made without any additional structure. 

FIG. 3 shows an example of several blocks executing 
in parallel. As can be seen from this example, block 3 
will abort when the first block is confirmed since block 
3 references the variable A which has been side effected 
in block 1. 

Since the system is simulating the behavior of a uni- 
processor by using parallel hardware, there is a defined 
order in which the blocks must be confirmed. Execution 
of a given block should start early enough so that its 
results are ready by the time that block is reached by the 
block counter 8. 

FIG. 4 shows a call tree of a process which provides 
heuristic data to guide the block starting process. Here, 
it is desirable to start execution of blocks 2 and 3, while 
the block counter is at block 1, because these are the 
next sequential blocks. One would also want to start 
execution of block 6, because it is unlikely that confirm- 
ing blocks 1-5 will side effect data which block 6 de- 
pends upon. 

But starting execution of a block earlier is potentially 
bad, because it will reference data which is older than 
necessary and because the processor being used to exe- 
cute it may be needed to execute a block which must be 
confirmed sooner. 

These observations lead to the following block execu- 
tion strategy: 

1. Start executing blocks ahead of this block, or 
downward and to the left in the call tree. 

2. Start executing blocks which branch downward to 
the right in the call tree. 

3. If additional processors are required for (1) or 
(2), abort execution of the furthest right branches of 

the call tree, and reclaim their processors. 
Heuristic (1) is very similar to instruction pre-fetch- 
ing on conventional pipelined architectures. Since we 
are executing this block, we will next need to execute its 
successor. 
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Heuristic (2) depends on the fact that parallel execu- 
tion of evaluations from the same level of the call tree 
are often independent, and can often be executed with- 
out conflicting side effects. 

Heuristic (3) is a simple consequence of the fact that 5 
we must finish execution of the bottom left portion of 
the call tree prior to executing the right portion of the 
call tree. Thus, we are better off using the processors 
executing blocks to the right for more immediate needs. 

These heuristics give a very good handle on the diffi- 10 
cult problem of resource allocation in the multiproces- 
sor environment. 

The presence of conditionals in the language, how- 
ever, makes the tidy resource allocation scheme de- 
scribed above much more difficult. Since one does not 15 
know which of the two paths a conditional will follow 
until the data computation is confirmed, one must start 
blocks executing down both branches, or risk delaying 
execution waiting for a block on the path not followed. 
Fortunately, one is free to start executing down both 20 
paths, since until a block is confirmed it does not side 
effect main memory. The blocks started along the unfol- 
lowed branch of a conditional are simply aborted with- 
out confirming. 

Allocation of processor resourses in the presence of 25 
conditionals is substantially more difficult, since one 
cannot, except heuristically, predict which of two paths 
is more worthwhile for expending processsor resources. 

Loops are a special case of conditional execution. As 
with most parallel processors, there is an advantage in 30 
knowing the high level semantics of a loop construct. 
Rather than attempting to execute repetitive copies of 
the same block, incrementing the loop index as a setup 
for the next pass, it is far preferably to know that a loop 
is required, and that the loop is, for example, indexing 35 
over a set of array values from to 100. One can then 
execute 101 copies of the loop block, each with a unique 
value of the loop index. The unpalatable alternative is to 
rely on the side effect detection hardware to notice that 
the loop index is changed in each block iteration, result- 40 
ing in an abort of the next block on every loop iteration. 

Unlike most of the parallel proposals for loop execu- 
tion, however, the systems herein can also handle the 
hard case where the execution of the lop body modifies 
the loop control parameters, perhaps in a very indirect 45 
or conditional way. 

The way the system deals with both conditional exe- 
cution and with loop iteration involves prediction of the 
future value of a variable. In a sense, the dependency list 
maintenance already described is a form of value pre- 50 
diction: one predicts that the value one is depending 
upon will not change by the time the executing block 
confirms. 

One can extend this idea to one which predicts the 
future value of variables likely to be modified from their 55 
current value prior to executing the block. One applica- 
tion of this technique is to the problem of loop itera- 
tions. Instead of relying on the current value of the loop 
index as the predicted value for the next iteration, one 
desires to predict the future value — predicting a differ- 60 
ent future value for each parallel body block started. 

Similarly, one can implement conditionals by predict- 
ing the value of the predicted result slot predicting that 
it is true in one arm of the conditional, and that it is false 
in the other. The predicate is then evaluated, and when 65 
it confirms, it side effects the predicate result slot, abort- 
ing one of the two branches. When aborting the execu- 
tion of conditional blocks, we must abandon execution 



of the block rather than update the variable and retry, as 
we would in normal block execution. 

The shared memory multiprocessor of the present 
invention, with each processor la-lc containing two 
fully associative caches effectively implements the 
above scheme. The first of the caches, the dependency 
cache 3a-3c, usually holds read data copies from main 
memory 6,7. This cache also watches bus transactions in 
a way similar to the snoopy cache coherency protocol 
as described in Goodman, "Using Cache Memory to 
Reduce Processor Memory Traffic" The 10th Annual 
International Symposium on Computer Architecture 
pp. 123-131 (1983). The second cache, the confirm 
cache 2a ,2c, holds only side effected data written by the 
associated processor, but not yet confirmed. 

Since the processors share a common memory bus 5, 
each cache can see all writes to main memory. The 
dependency cache acts as a normal data read cache, but 
also implements the dependency list and predicted 
value features. The confirm cache is used to allow pro- 
cessors to locally side effect their version of main mem- 
ory, without those changes being visible to other pro- 
cessors prior to their block confirmation. 

Each main memory location, therefore, can have two 
entries in a processor's caches, one in the dependency 
cache, and one in the confirm cache. This is because a 
knowledge must be maintained of what data the proces- 
sor's current computation has depended upon, in the 
dependency cache, while also allowing the processor to 
tentatively update its version of memory, in the confirm 
cache. For processor reads, priority is always given to 
the contents of the confirm cache if there is an entry in 
both caches, because one wants the processor to see its 
own modifications to memory, prior to them being 
confirmed. 

The dependency cache performs three functions: 

1. It acts as a normal read data cache for main mem- 
ory data. 

2. It stores block dependency information. 

3. It holds predicted values of variables associated 
with this block. 

FIG. 5 shows a state transition diagram for the depen- 
dency cache as carried out by the state transition logic 
4a, 4b, 4c. There are six states shown in the cache state 
diagram: INVALID, VALID, DEPENDS, PRE- 
DICT/VALID, PREDICT/INVALID and PRE- 
DICT/ABANDON. 

INVALID is the state associated with an empty 
cache line. 

VALID is the state representing that the cache holds 
a correct copy of a main memory date. 

DEPENDS indicates that the cache holds a correct 
data, and that the ongoing block computation is depend- 
ing on the continued correctness of this value. 

PREDICT/VALID indicates that the block is pre- 
dicting that, when it is ready to confirm, the value held 
in this cache location will continue to be equal to the 
value in main memory. This state indicates that the held 
value is now equal to main memory. This state differs 
from DEPENDS only in the action taken when a bus 
cycle modifies this memory location. 

PREDICT/INVALID indicates that the block is 
predicting that, when it is ready to confirm, the value 
held in this cache location will be equal to the value in 
main memory. The contents of main memory currently 
differs from the value held in the cache. 

PREDICT/ ABANDON indicates that the block 
execution is conditional on the eventual contents of 
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memory being side effected to be equal to that held in 
the cache. If it is side effected to some other value, the 
block will be aborted and not restarted (abandoned). 

At the start of each block execution, the dependency 
cache is initialized. This results in setting all locations to 5 
either the INVALID or VALID states. Cache entries 
which are currently VALID, DEPENDS, or PRE- 
DICT/VALID are forced to the VALID state. All 
others are forced to the INVALID state. 

Each time the processor loads a data item into a regis- 10 
ter from the depends cache, the state of the cache line is 
potentially modified. If the cache entry for the location 
is the INVALID state, a memory bus read request is 
performed, a new cache line allocated, the data stored 
in the depends cache, and the line state is set to DE- 15 
PENDS. If the cache line is already VALID, then the 
state is simply set to DEPENDS. In all other states, the 
data is provided from the cache, and the state of the 
entry is unmodified. 

A cache miss is allowed to replace any entry in the 20 
cache which is in the INVALID or VALID states, but 
must not modify other entries. We require this because 
the dependencies are being maintained by these states. 
This implies that the cache must be fully associative, 
since otherwise a conflict over the limited number of 25 
sets available for a given location would make it impos- 
sible to store the dependency information. 

A Predict-Value request from the processor stores a 
predicted future value for a memory location into the 
dependency cache. The cache initiates a main memory 30 
read cycle for the same location. If the location contains 
the predicted value, the cache state is set to PRE- 
DICT/VALID. If the values disagree, the cache state is 
set to PREDICT/INVALID. 

A Predict- Abandon request from the processor stores 35 
a predicted future value of a predicate into the depends 
cache, and forces the cache line to the PREDICT/A- 
BANDON state. 

A bus write cycle, taken by some other processor, 
can potentially modify a location in main memory 40 
which is held in the depends cache. Since there is a 
shared memory bus connecting all of the caches, each 
cache can monitor all of these writes. For INVALID 
cache entries, nothing is done. For VALID entries, the 
newly written data is copied into the cache, and the 45 
cache maintains validity. 

For DEPEND entries, the cache is updated, and if 
the new contents differs from the old contents, the run- 
ning block on the cache's associated processor is 
aborted and then restarted. This is the key mechanism 50 
for enforcing inter-block sequential consistency. 

Bus writes to locations which are PREDICT/- 
VALID or PREDICT/INVALID are compared with 
the contents of the depends cache. If the values agree, 
the cache state is set to PREDICT/VALID. If they 55 
disagree, the state is set to PREDICT/INVALID. 

Bus writes to locations which are PREDICT/ A- 
BANDON are compared with the contents of the de- 
pends cache. If they agree, no change occurs. If they 
disagree, the associated processor is aborted, and the 60 
block currently executing is permanently abandoned. 

The confirm cache preferably comprises a fully asso- 
ciative cache which holds only side-effected data. 
When the block is initialized, the confirm cache is emp- 
tied by invalidating all its entries. When the processor 65 
performs a side-effecting writ, the write data is held in 
the confirm cache, but not written immediately into 
main memory. There, it is visible to the processor which 



performed the side-effect, but to no other processors. 
The confirm cache has priority over the dependency 
cache in providing read data to the processor. If both 
hold the contents of a location, the data is provided by 
the confirm cache. This allows a processor to modify a 
location, and to have those modifications visible during 
further computation within the block. 

When the block counter 8 reaches a block associated 
with a particular processor, and the block has com- 
pleted execution, it is time to confirm that block. One 
final memory consistency check is performed. The load 
dependencies are necessarily satisfied, because if they 
were not, then the block would have been aborted. But 
the predicted-value dependencies may not be satisfied. 
These dependencies are checked by testing to see if 
there are any entries in the dependency cache which are 
in the PREDICT/INVALID state. Any entry in that 
state indicates a memory location whose contents was 
predicted to be one value, and whose actual memory 
contents are another. This block must be re-executed 
with the now-current value. 

After performing this final consistency check, the 
side effects associated with this block may be per- 
formed. This operation consists of sweeping the con- 
firm cache, and writing back to memory any modified 
entries. The write-back of these entries may, of course, 
force other processors to abort partially or completely 
executed blocks, if they have depended on the old val- 
ues of these locations. 

One consequence of using this cache strategy for 
implementing the dependency checking is that each 
block can perform multiple side-effects, and those side- 
effects will be visible only to the executing block until 
the block is confirmed. This enables each block to have 
more than just the terminal side-effect and allows the 
compiler flexibility in choosing the optimal block size 
independent of how many side-effects are present 
within the block. 

Another important feature of using the cache as a 
dependency tracking mechanism is that when a transac- 
tion aborts, the valid entries in the cache remain, so that 
the re-execution of the block will likely achieve an 
almost 100% cache hit rate, reducing memory bus traf- 
fic and improving processor speed. 

The Lisp operation cons, although it performs a write 
into main memory, is not a side effecting instruction, 
since it is guaranteed that there is no other pointer to the 
location being written, and thus no other block can be 
affected by its change. In the block execution architec- 
ture, this can be implemented by providing each proces- 
sor with an independent free pointer. Consing and other 
local memory allocating writes done within a block are 
performed using the free pointer. The value of the 
pointer is saved when the block is entered, and abort 
rests the pointer to its entry value, automatically re- 
claiming the allocated storage. During the confirm, the 
free pointer's value is updated. All of these free pointer 
manipulations happen automatically if it is treated sim- 
ply as another variable whose value can be loaded and 
side effected. Writes of data into newly consed locations 
need not be confirmed. They can be best handled with 
a write-through technique in the dependency cache. It 
is important that the dependency cache not contain stale 
copies of data which has been written with a cons write 
operation. 

The addition of one primitive to the process instruc- 
tion set allows the re-introduction of explicit program- 
mer visible parallelism into the language supported by 
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this architecture. The primitive performs a load opera- 
tion without adding the location being loaded to the 
dependency list. In the depends cache, it reurns Valid 
data if present, and reads memory if necessary to pro- 
duce valid data. It never changes the state of the cache 5 
to Depends. 

With this primitive, it is possible to express algo- 
rithms which compute results based on potentially stale 
data. In those cases where this is acceptable from an 
algorithmic point of view, then parallel execution with 10 
no inter-block dependencies can be supported. One 
example is the Gauss-Jacobi parallel iterative matrix 
solution technique, as contrasted with the sequential 
Gauss-Seidel technique. 

In an alternative hardware implementation shown in 15 
FIG. 6, dependencies are stored using a hashing tech- 
nique rather than a fully associative cache. This imple- 
mentation may be attractive because the RAM used is 
less expensive than the fully associated storage used in 
the cache based design. 20 

The circuit includes processors 11a, 116, lie con- 
nected through the confirm cache 12a, 126, 12c to the 
shared system bus 16 with memories 15a, 156. The cir- 
cuit also includes a hash function box 13a, 126, 13c 
which receives the main memory address which is for 25 
example thirty-two bits and hashes it into a reproducible 
hashed address of for example 12 bits. The twelve bit 
address is used to address a 4096 X 1 RAM 14a, 146, 14c. 
At the start of an instruction block execution in proces- 
sor 11a, the Clear signal is asserted, and the RAM 14a is 30 
cleared, either by writing each location to a logic zero, 
or through special means for simultaneously clearing all 
locations. Each time processor 11a performs a memory 
load operation it asserts the write signal, and writes a 
logic one into RAM 14a at the location addressed by 35 
the hash function. Each time processor 116 terminates a 
block by performing its last side effect, the memory 
store from processor 116 to memory 15a or 15c is visible 
to processor 11a. Hash box 13a calculates the hashed 
address, and examines the bit in RAM 14a correspond- 40 
ing to the hashed address being written by processor 
116. It this bit is a logic one, then the block just con- 
firmed on processor 116 has potentially (but not neces- 
sarily) invalidated by computation being performed by 
processor 11a. Processor 11a must abort, clear RAM 45 
14a, and retry executing the block. 

What is claimed is: 

1. In a parallel processing system receptive of a pro- 
gram and having at least two processors connected in 
parallel to a shared main memory, wherein each proces- 50 
sor executes instructions of a program which includes 
side effecting instructions which modify the contents of 
a location in main memory and functional instructions 
which reference locations in main memory or operate 
only on internal processor registers, the improvement 55 
comprising: means for compiling a program into a series 
of independent instruction blocks including predomi- 
nantly functional instructions and terminating in a side 
effecting instruction; means for applying compiled 
blocks to the at least two processors for executing the 60 
blocks in parallel; and means for validating data to be 
stored in a location of main memory from each parallel 
processed block relative to locations referenced during 
execution by the other of the at least two processors, 
wherein the validating means comprises, for each pro- 65 
cessor, means for maintaining a dependency list of all 
locations in main memory which have been referenced 
during the execution of a block therein. 



2. The system according to claim 1, wherein the vali- 
dating means comprises first cache means associated 
with each processor for temporarily storing the location 
and contents of each location in main memory to be side 
effected by the execution of the block therein and means 
for confirming each block in the order of the blocks as 
specified in the program wherein the validity of the 
contents of the first cache means are approved compris- 
ing means for comparing the dependency list of the 
block to be confirmed with the locations side effected 
by already confirmed blocks to detect a match and 
means for aborting the execution of the block to be 
confirmed if there is a match. 

3. The system according to claim 2, wherein the con- 
firming means comprises a block counter and means for 
incrementing the block counter to indicate the next 
block to be confirmed. 

4. The system according to claim 2, wherein the 
means for maintaining a dependency list comprises sec- 
ond cache means associated with each processor for 
temporarily storing each location and contents of main 
memory referenced by a block being executed in the 
processor and wherein the confirming means comprises 
means for updating the contents of the second cache 
means upon the aborting of a block and for effecting the 
reexecution of an aborted block with the side effected 
location from an already confirmed block. 

5. The system according to claim 4, wherein the first 
and second cache means each comprises a fully associa- 
tive content addressable memory. 

6. The system according to claim 4, wherein the sec- 
ond cache means comprises an addressable random 
access memory for storing dependency information of 
main memory locations and hash table means receptive 
of said addresses of the main memory locations for 
hashing them into a reduced member of address bits for 
addressing the random access memory. 

7. The system according to claim 1, wherein the 
means for maintaining a dependency list includes means 
for selectively referencing locations in main memory 
without adding same to the dependency list. 

8. In a method for executing a program in at least two 
processors connected in parallel to a shared main mem- 
ory, wherein the program includes side effecting in- 
structions which modify the contents of a location in 
main memory and functional instructions which refer- 
ence locations in main memory, the improvement com- 
prising: compiling the program into a series of indepen- 
dent instruction blocks including predominantly func- 
tional instructions and terminating in a side effecting 
instruction, applying compiled blocks to the processors, 
executing the blocks in parallel and validating data to be 
stored in a location of main memory from each parallel 
processed block relative to locations referenced during 
execution by the other of the at least two processors by 
maintaining a dependency list for each processor of all 
locations in main memory which have been referenced 
during the execution of a block therein. 

9. The method according to claim 8, wherein the step 
of validating further comprises for each processor tem- 
porarily storing the location and contents of each loca- 
tion in main memory to be side effected by the execu- 
tion of the block therein and confirming each block in 
the order of the blocks as specified in the program to 
approve the validity of the stored contents by compar- 
ing the dependency list of the block to be confirmed 
with the locations which have been side effected by the 
already confirmed blocks to detect a match. 
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10. The method according to claim 9, wherein the 
step of validating further comprises aborting the execu- 
tion of a block when there is a match. 

11. The method according to claim 10, wherein the 
step of maintaining a dependency list comprises tempo- 
rarily storing for each processor each location and con- 
tents of main memory referenced by the block being 
executed therein and wherein the step of validating 
further comprises reexecuting an aborted block with the 10 
side effected location from an already confirmed block. 



12. The method according to claim 11, wherein the 
step of temporarily storing the referenced locations 
comprises hashing each location address into a hashed 
address having a reduced number of address bits and 
storing dependency information at the hashed address 
in a random access memory. 

13. The method according to claim 8, where in the 
step of maintaining a dependency list includes selec- 
tively omitting referenced locations from the depen- 
dency list. 
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ABSTRACT 



A parallel processing system is receptive of a program 
and has at least two processors connected in parallel to 
a shared main memory. Each processor executes in- 
structions of the program which includes side-effecting 
instructions which modify the contents of a location in 
main memory and functional instructions which refer- 
ence locations in main memory. The program is com- 
piled into a series of independent instruction blocks 
each of which includes predominantly functional in- 
structions and terminates in a side-effecting instruction 
and the processors execute the blocks in parallel. 

13 Claims, 4 Drawing Sheets 
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algorithm in terms of the internal state of a set of objects 
SYSTEM AND METHOD FOR PARALLEL as, for example, discussed in chapter 3 of Abelson and 

PROCESSING WITH MOSTLY FUNCTIONAL Sussman, Structure and Interpretation of Computer 

LANGUAGES Programs, M.I.T. Press, Cambridge, (1985). This object 

5 oriented viewpoint in programming language design is 
BACKGROUND OF THE INVENTION incompatible with the strictly functional approach to 

The present invention relates to a computer system computation. The same algorithms can be implemented 
and a method for parallel processing program instruc- m either style, and both systems are surely univer- 
tions. sal — but the fact that programmers find one representa- 

Many parallel architectures start by defining the pro- 10 tion for programs easier to think about, easier to design, 
gramming model they support as strictly functional. and easier to debug is, in itself, a powerful motivation 
Computer architects find that the strictly functional for a programming language to provide that representa- 
approach makes their task easier, because the order in tion. 

which computation is performed, once the basic depen- If this were merely a representation issue, there 

dencies are satisfied, is irrelevant. Building highly paral- 15 would be hope that suitable compiler technology could 
lei machines which execute strictly functional Ian- eventually lead to a programming language which pro- 
guages then becomes relatively straightforward. vides support for side effects, but which compiles into 
Strictly functional languages certainly have advantages, purely functional operations. As has been discussed 
including the relative ease of proving their correctness above, however, in the general case this is not possible, 
and the ease of thinking and reasoning about certain 20 side effects cause problems. Architects find it diffi- 
simple kinds of programs, as is disclosed in Backus, cmt to build parallel architectures for supporting them. 
"Can Programming be Liberated from the von Neu- Verification software finds such programs difficult to 
mann Style? A Functional Style and its Algebra of reason about a^j programming styles, with an abun- 
Programs" Communications of the ACM Vol. 21 no. 8 dance of side effect difficult t0 understand> modify 
pp. 613-641 (August 1978). 25 ^ maint ^ 

Unfortunately there is a class of algorithms for In accordance ^ the present mvent i oni instead of 

which the strictly functional programming style ap- c0 letel eliminating sid £ effectSi it fa proposed that 
pears to be inadequate. The problems are rooted m the ^ ^ ^ ^tafl^ Mogt si ^£ & fa 
difficulty in interfacing a strictly functional program o ventional ammin , like Fortran ^ 

a non-functional outside environment. It is impossible to 30 JL 7 i • .■•«• ,. ,.•. >• 

program a task such as an operating system; for in- tmtous - 7**? m n /f ^^ difficult multi-tasking 
stance, in the unmodified strictly functional style. f ommumcations problems. Nor are they improving the 
Clever modifications of the strict functional style have lar « e scale modularity or c anty of the program. In- 
been proposed which cure this defect. For example, 8t f 4 thev m used ,, for mvlal P"^^ such M "P^S 
McCarthy's AMB operator, which returns whichever 35 a l0 °P mde . x variable. 

of two inputs are ready first, appears to be adequate to Elimination of these unnecessary side effects can 

solve this and similar problems. Several equivalent op- make code more rcadable ' more maintainable, and as set 
erators have been proposed in Peter Henderson, "Is It forth hereinafter in accordance with the invention, 
Reasonable to Implement a Complete Programming faster to execute. 

System in a Purely Functional Style? " Technical memo 40 Man y of *«* same lssues motivate Halstead's Multi- 
PMM/94, The University of Newcastle upon Tyne Lis P' a parallel variant of Scheme as disclosed in "Mul- 
Computing Laboratory (December 1980). tiLis P : A Language for Concurrent Symbolic Computa- 

However, the introduction of such non-functional tion " ACM Transactions on Programming Languages 
operators into the programming language destroys ^ Systems, (1985). The approach Multi-Lisp takes to 
many of its advantages. Architects can no longer ignore 45 providing access to parallel computation is the addition 
the order of execution of functions. Proving program of programmer visible primitives to the language. The 
correctness again becomes a very much more difficult three primitives which distinguish it from conventional 
task. Users also find the programming with such opera- Lisp are: 

tors is tedious and error prone. In particular, as shown The future primitive which allows an encapsulated 

by Agha in "Actors: A Model of Concurrent Computa- 50 value to be evaluated, while simultaneously returning to 
tion in Distributed Systems" M.I.T. Artificial Intelli- its caller a promise for that value. The caller can con- 
gence Laboratory Technical Memo 844, (1985), the tinue to compute with this returned object, incorporat- 
presence of such an operator allows the definition of a ing it into data structures, and passing it to other func- 
new data type, the cell, which can be side effected in tions. Only when the value is examined is it necessary 
just as normal memory locations in non-functional Ian- 55 for the parallel computation of its value to complete; 
guages. Thus the introduction of the AMB operator, the pcall primitive which allows the parallel evalua- 

while adequate to solve the lack of side effect problem, tion of arguments to a function. Halstead teaches that 
re-introduces many of the problems which the func- pcall can be implemented as a simple macro which 
tional language advocates are attempting to avoid. expands into a sequence of futures. It can be thought of 

Therefore conventional thought indicated the choice 60 as syntactic sugar for a stylized use of futures; and 
of either defining weak systems, such as purely func- the delay primitive which allows a programmer to 

tional languages, about which one can prove theorems, specify that a particular computation be delayed until 
or defining strong systems which one can prove little the result is needed. Similar in some respects to a future, 
about. delay returns a promise to compute the value, but does 

Much more serious than any of the theoretical diffi- 65 not begin computation of a value until the result is 
culties of strictly functional languages, is the difficulty needed. Thus the delay primitive is not a source of 
programmers face in implementing certain types of parallelism in the language. It is a way of providing lazy 
programs. It is often more natural to think about an evaluation semantics to the programmer. 
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Both delay and future result in an order of execution 
different from applicative order computation. In the 
absence of side effects, both will result in the same value 
as the equivalent program without the primitives, since 
they affect only the order in which the computation is 5 
performed. This is not strictly true for delay since its 
careful use can allow otherwise non-terminating pro- 
grams to return a value. 

In the presence of side effects, it is difficult to predict 
the behavior of a future, since its value may be com- 10 
puted in parallel with other computations. 

While the value of a delay is deterministic, since it 
does not introduce additional parallelism, its time of 
computation is dependent on when its value is first 
examined. This can be very non-intuitive and difficult to 15 
think about while writing programs. 

Halstead implements the future and delay primitives 
by returning the caller an object of data-type future. 
This object contains a pointer to a slot, which will even- 
tually contain the value returned by the promised com- 20 
putation. The future may be stored in data structures 
and passed as a value. Computations which attempt to 
reference the value of the future prior to its computa- 
tion are suspended. When the promised computation 
completes, the returned value is stored in the specified 25 
slot, and the future becomes determined. This allows 
any pending suspended procedures waiting for this 
value to run, and any further references to the future 
will simply return the now computed value. 

The use of future or delay can be thought of as a 30 
declaration. Their use declares that either the computa- 
tion done within the scope of the delay or future has no 
side effects, or that, if it does, the order in which those 
side effects are done relative to other computations is 
irrelevant. There is also a guarantee with such a decla- 35 
ration that no free or shared variable referenced by the 
computation is side effected by some other parallel- 
executing portion of the program. 

Like all declarations, the use of future or delay is a 
very strong assertion. In a way similar to type declara- 40 
tions, their use is difficult to check automatically, is 
error prone, has a significant performance impact if 
omitted, and may function correctly for many test cases, 
but fail unexpectedly on others. 

Advocates of strong type checking in compiled Ian- 45 
guages have attempted for years to build compilers 
capable of proving the type correctness of programs. It 
is believed that they have failed. All languages sophisti- 
cated enough for serious programming require at least 
some dynamic checks for type safety at execution time. 50 
These checks are implemented as additional instructions 
on conventional architectures, or as part of the normal 
instruction execution sequence on more recent architec- 
tures as discussed in Moon, "The Architecture of the 
Symbolics 3600" The 12th Annual International Sym- 55 
posium on Computer Archecture pp. 76-83 (1985). One 
alternative, declaring the types of relying on the word 
of the programmer, while leading to good performance 
on conventional architectures, is dangerous, error- 
prone, and inappropriate for sophisticated modern pro- 60 
gramming environments. 

SUMMARY OF THE INVENTION 

The main object of the present invention is to elimi- 
nate the disadvantages of the prior art systems. 65 

The present invention relates to a computer system 
having the ability to execute several portions of a pro- 
gram in parallel, while giving the appearance to the 



programmer that the program is being executed sequen- 
tially. This automatic extraction of parallelism makes 
the architecture attractive because one can exploit the 
existing programming tools, languages and algorithms 
which have been developed over the past several dec- 
ades of computer science research, while gradually 
endeavoring to improve them for the parallel environ- 
ment. While not all of the benefits of parallel execution 
will be attained with this architecture, the ability to 
extract even modest amounts of parallel execution from 
existing programs may be worth the effort. Moreover, 
the architecture can be extended, to provide some ex- 
plicit and modular programmer control over the paral- 
lelism. 

In achieving the invention, just as hardware has pro- 
vided important runtime support for type checking, it is 
proposed to use hardware in the runtime checking for 
future declarations. 

In a proposed Lisp implementation, explicit language 
extensions such as future and pcall are not used. Instead, 
it is assumed that each evaluation is encapsulated in a 
future and that each call can freely evaluate its argu- 
ments in parallel. The .approach is to detect and correct 
those cases in which this assumption is unjustified. 

In accordance with the invention a compiler trans- 
forms the program into sequences of compiled primitive 
instructions, called transaction blocks. Each transaction 
block has a mostly functional portion and terminates in 
a side effect. While the mostly functional portion prefer- 
ably has no side effecting instructions, the system will 
operate effectively as long as at least the majority of the 
instructions are functional, that is only reference but do 
not modify main memory. 

Each block is independent of previous execution, 
except for the contents of main memory. In particular, 
no registers or other internal bits in the processor are 
shared across blocks. 

Each block contains exactly one side-effecting store 
into main memory, occurring at the end of the block. 

Each transaction block is, from the standpoint of the 
memory system, a mostly functional program consisting 
of register loads and register-to-register instructions, 
intermediate side effecting stores and ending with side 
effecting store into main memory. The results of the 
side effecting stores are temporarily stored, before 
being applied to main memory. 

Each of these blocks can be executed independently, 
on several different processors. Moreover, there is no 
interaction between blocks until the blocks are con- 
firmed, that is the results are validated. However, the 
order in which the blocks are confirmed is critical. The 
side effects must be executed in a well defined order: the 
order specified in the program. Further, the execution 
of side effect can potentially modify a location which 
some other block has already read. 

These two tasks — confirming in order, and enforcing 
the dependencies of later blocks on earlier side effect- 
s' — require special hardware in accordance with the 
invention. 

Each processor is free to execute its block at will. The 
results of the side effects are stored in a cache associated 
with the processor. As a result, this execution cannot * 

affect other processors. A block counter is maintained, ^l»l K 
similar in purpose to a normal program counter. The *J ' •* 
block counter's job is to keep track of which block is 
next allowed to be confirmed. As the block counter 
reaches each block in turn, the block compares its refer- 
enced locations with modified location of already con- 
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firmed blocks. If there is no comparison, this block is The system accordingly further comprises for each 
confirmed. Confirming a block allows the results stored processor, means for maintaining a dependency list of 
in the cache to be sent to main memory and modify the all locations in main memory which have been refer- 
contents of main memory. Other blocks, not yet con- enced during the execution of a block therein, first 
firmed, may already have referenced the location which 5 cache means associated with each processor for tempo- 
has just been modified. If they have, then the data upon rarily storing the location and contents of each location 
which they are computing is now invalid when they are in main memory to be side effected by the execution of 
confirmed. the block in the associated processor and means are 

But the program which they are executing has not provided for confirming each block in the order of the 

modified main memory since the results are stored in a 10 blocks as specified in the program wherein the validity 

cache. Thus the system is free to abandon the uncon- of the contents of the first cache means are approved, 

firmed executed block, and to attempt re-executing it a The confirming means comprise means for comparing 

second time. This abandoning of an unconfirmed exe- the dependency list of the block to be confirmed with 

cuted block is called aborting the block. the locations side effected by already confirmed blocks 

In order to detect when the block needs to be * 5 to detect a match and means for aborting the execution 
aborted, one needs to keep track of which memory of the block to be confirmed if there is a match, 
locations it has referenced. A dependency list is there- In a preferred embodiment, the confirming means 
fore maintained, containing the addresses of all loca- comprises a block counter and means for incrementing 
tions in main memory which have been referenced dur- the block counter to indicate the next block to be con- 
ing execution of a block. When a processor executes a firmed. The means for maintaining the dependency list 
load instruction, it adds the address being loaded to the comprises second cache means associated with each 
executing block's dependency list. processor for temporarily storing each location and 

When a block is confirmed, the address it is side ef- contents of main memory referenced by a block being 
fecting is broadcast to all other executing processors. executed in the processor. The confirming means fur- 
Each processor checks its dependency list to determine ther comprises means for updating the contents of the 
if the block it is executing has referenced the modified second cache means upon the aborting of a block and 
location. If it has, the block is aborted and re-executed. for effecting the reexecution of an aborted block with 

Eventually, a given block will be reached by the the side effected location from an already confirmed 

block counter, and it will be allowed to complete and , Q block. 

perform its side-effect. The block counter advances to In one embodiment the first and second cache means 

the next block, and if the block is ready, that block is each comprises a fully associative content addressable 

confirmed in the following cycle. If the confirmation of memory. In the alternative embodiment, the second 

one block aborts the execution of successor blocks, cache means comprises an addressable random access 

however, the performance will be limited by the time it 35 memory, for storing the dependency information of 

takes the aborted block to re-execute its program. main memory locations and hash table means receptive 

The performance improvement resulting from this of said addresses of the main memory locations for 

technique depends on two factors: the average block hashing them into a reduced member of address bits for 

length, and the frequency of aborts. The longer the addressing the random access memory, 

block, the fewer confirms are necessary to execute a 4Q In accordance with the present invention, the method 

given program. The more frequent the aborts, the more for executing a program in at least two processors con- 

the block counter must await a sequential computation nected in parallel to a shared main memory, comprises 

before corifirming the next block. compiling the program into a series of independent 

One can see now the influence of reducing the num- instruction blocks including predominantly functional 

ber of side effects in the source program: as the number 45 instructions and terminating in a side-effecting instruc- 

of side effects is reduced, the length of the blocks can be tion, applying the blocks to the processors and execut- 

increased. Since the blocks are longer, fewer blocks ing the blocks in parallel. 

need be executed to perform a computation. Since the The step of executing comprises maintaining a depen- 

block execution is sequential, at a rate limited to a maxi- dency list for each processor of all locations in main 

mum of one per clock cycle, this limits the performance 50 memory which have been referenced during the execu- 

of the architecture. There is probably a maximum desir- tion of a block therein, for each processor temporarily 

able block length, since as the block size increases, the storing the location and contents of each location in 

length of the dependency list, and thus the likelihood of main memory to be side effected by the execution of the 

a side effect from another block influencing the compu- block therein and confirming each block in the order of 

tation goes up. 55 the blocks as specified in the program to approve the 

In accordance with the invention, the parallel pro- validity of the temporarily stored contents by compar- 

cessing system of the invention is receptive of a pro- ing the dependency list of the block to be confirmed 

gram and has at least two processors connected in paral- with the locations which have been side effected by the 

lei to a shared main memory. Each processor executes already confirmed blocks to detect a match, 

instructions of the program which includes side effect- 60 When there is a match the execution of the block is 

ing instructions which modify the contents of a location aborted. 

in main memory and functional instructions which ref- The step of maintaining a dependency list preferably 

erence locations in main memory. Means are provided comprises temporarily storing, for each processor, each 

for compiling a program into a series of independent location and contents of main memory referenced by 

instruction blocks including predominantly functional 65 the block being executed therein and wherein the step 

instructions and terminating in a side-effecting instruc- of executing further comprises reexecuting an aborted 

tion and wherein the processors execute the blocks in block with the side effected location from an already 

parallel. confirmed block. 
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In another embodiment, the step of temporarily stor- 
ing the referenced locations comprises hashing each 
location address into a hashed address having a reduced 
number of address bits and storing a logic one at the 
hashed address in a random access memory. 5 

Moreover, the method includes selectively omitting 
referenced locations from the dependency list when 
desired. 

These and other features of the invention will be 
described in the following description with reference to 10 
the drawings, wherein: 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a block diagram of a system according to 
the invention; 15 

FIG. 2 shows an example of a transaction block ac- 
cording to the invention; 

FIG. 3 shows an example of several blocks executing 
in parallel; 

FIG. 4 is a call tree for guiding the block starting 20 
process according to the invention; 

FIG. 5 is a state transition diagram of the state transi- 
tion logic for the dependency cache; 

FIG. 6 is an alternate embodiment for the depen- 
dency cache shown in FIG. 1. 25 

DETAILED DESCRIPTION OF THE 

INVENTION 

Referring now to FIG. 1, the system in accordance 
with the present invention includes processors la, lb, 1c 30 
connected in parallel to common bus 5 to share main 
memories 6 and 7. It should be noted that while three 
processors are shown in parallel for the purpose of 
example, the system according to the present invention 
can utilize at least two processors in parallel up to an 35 
indefinite number in accordance with the method and 
system of the present invention. It should also be noted 
that the present invention is also capable of operating in 
a quasi parallel processor of the type wherein a single 
processor which executes two or more programs in an 40 
interleaved manner is also contemplated as being within 
the scope of the present invention. 

Connected between each processor and the bus 5 are 
a confirm cache 2a, 2b, 2c and a dependency cache 3a, 
3b, 3c which has associated state transition logic 4a, 46, 45 
4c connected thereto. Processors la, lb and lc also have 
connections to a block counter 8, which will be de- 
scribed hereinafter. 

The system also includes compiler 9 connected to the 
bus in a conventional manner for transforming an input 50 
program into sequences of compiled primitive instruc- 
tions called transaction or instruction blocks. 

Each block, as shown in FIG. 2, has a mostly func- 
tional portion and terminates in a side effecting store 
instruction. As noted, the functional instructions in- 55 
elude those which reference main memory to carry out 
the operations thereof or operate only on internal pro- 
cessor registers, while the side effecting instructions are 
those which require modification of the contents of a 
location in main memory. 60 

The mostly functional portion can have no side ef- 
fecting instructions or one or more side effecting in- 
structions, as long as at least a majority of the instruc- 
tions thereof are functional. 

In accordance with the method of the present inven- 65 
tion, after the compiler 9 has broken up the program 
into the blocks, the blocks are applied to the processors 
la-lc for execution in parallel. Each reference to main 
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memory is stored in the dependency cache 3a-3c 
whereas each side effecting instruction is carried out 
and written only into the confirm cache 2a-2c. 

Thus each processor is free to execute its block at will 
since all referenced locations and all locations to be 
modified are stored in the caches 2a-2c and 3a-3c with- 
out the contents of main memory 6, 7 being affected. 

In order to validate the data for being written into 
main memory to modify the contents thereof, each 
block must first be confirmed. The blocks are confirmed 
in the order specified in the overall program and this is 
carried out by the block counter 8 which specifies the 
order in which the processors la-lc will be confirmed. 

The dependency caches 3a-3c, by keeping track of all 
memory locations referenced by the processor during 
the execution of a block, thereby maintains a depen- 
dency list of the data which is depended on by the block 
in order to achieve the execution thereof. After the first 
block is confirmed, all of the side effected locations 
stored in the confirmed cache corresponding thereto is 
placed out on the bus and received by the dependency 
caches of the other blocks to see if any of these side 
effected locations were depended upon by the blocks 
during execution. If this is the case and there is a match, 
then the executed unconfirmed block must be aborted 
and reexecuted with the updated side effected location 
data. 

In a preferred embodiment of the present invention, 
the confirm and dependency caches are fully associative 
content addressable memories which enable the com- 
parison to be made without any additional structure. 

FIG. 3 shows an example of several blocks executing 
in parallel. As can be seen from this example, block 3 
will abort when the first block is confirmed since block 
3 references the variable A which has been side effected 
in block : l. 

Since the system is simulating the behavior of a uni- 
processor by using parallel hardware, there is a defined 
order in which the blocks must be confirmed. Execution 
of a given block should start early enough so that its 
results are ready by the time that block is reached by the 
block counter 8. 

FIG. 4 shows a call tree of a process which provides 
heuristic data to guide the block starting process. Here, 
it is desirable to start execution of blocks 2 and 3, while 
the block counter is at block 1, because these are the 
next sequential blocks. One would also want to start 
execution of block 6, because it is unlikely that confirm- 
ing blocks 1-5 will side effect data which block 6 de- 
pends upon. 

But starting execution of a block earlier is potentially 
bad, because it will reference data which is older than 
necessary and because the processor being used to exe- 
cute it may be needed to execute a block which must be 
confirmed sooner. 

These observations lead to the following block execu- 
tion strategy: 



1. 



or 



Start executing blocks ahead of this block, 
downward and to the left in the call tree. 

2. Start executing blocks which branch downward to 
the right in the call tree. 

3. If additional processors are required for (1) or 
(2), abort execution of the furthest right branches of 

the call tree, and reclaim their processors. 
Heuristic (1) is very similar to instruction pre-fetch- 
ing on conventional pipelined architectures. Since we 
are executing this block, we will next need to execute its 
successor. 
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Heuristic (2) depends on the fact that parallel execu- 
tion of evaluations from the same level of the call tree 
are often independent, and can often be executed with- 
out conflicting side effects. 

Heuristic (3) is a simple consequence of the fact that 5 
we must finish execution of the bottom left portion of 
the call tree prior to executing the right portion of the 
call tree. Thus, we are better off using the processors 
executing blocks to the right for more immediate needs. 

These heuristics give a very good handle on the diffi- 10 
cult problem of resource allocation in the multiproces- 
sor environment. 

The presence of conditionals in the language, how- 
ever, makes the tidy resource allocation scheme de- 
scribed above much more difficult. Since one does not 15 
know which of the two paths a conditional will follow 
until the data computation is confirmed, one must start 
blocks executing down both branches, or risk delaying 
execution waiting for a block on the path not followed. 
Fortunately, one is free to start executing down both 20 
paths, since until a block is confirmed it does not side 
effect main memory. The blocks started along the unfol- 
lowed branch of a conditional are simply aborted with- 
out confirming. 

Allocation of processor resourses in the presence of 25 
conditionals is substantially more difficult, since one 
cannot, except heuristically, predict which of two paths 
is more worthwhile for expending processsor resources. 

Loops are a special case of conditional execution. As 
with most parallel processors, there is an advantage in 30 
knowing the high level semantics of a loop construct. 
Rather than attempting to execute repetitive copies of 
the same block, incrementing the loop index as a setup 
for the next pass, it is far preferably to know that a loop 
is required, and that the loop is, for example, indexing 35 
over a set of array values from to 100. One can then 
execute 101 copies of the loop block, each with a unique 
value of the loop index. The unpalatable alternative is to 
rely on the side effect detection hardware to notice that 
the loop index is changed in each block iteration, result- 40 
ing in an abort of the next block on every loop iteration. 

Unlike most of the parallel proposals for loop execu- 
tion, however, the systems herein can also handle the 
hard case where the execution of the lop body modifies 
the loop control parameters, perhaps in a very indirect 45 
or conditional way. 

The way the system deals with both conditional exe- 
cution and with loop iteration involves prediction of the 
future value of a variable. In a sense, the dependency list 
maintenance already described is a form of value pre- 50 
diction: one predicts that the value one is depending 
upon will not change by the time the executing block 
confirms. 

One can extend this idea to one which predicts the 
future value of variables likely to be modified from their 55 
current value prior to executing the block. One applica- 
tion of this technique is to the problem of loop itera- 
tions. Instead of relying on the current value of the loop 
index as the predicted value for the next iteration, one 
desires to predict the future value — predicting a differ- 60 
ent future value for each parallel body block started. 

Similarly, one can implement conditionals by predict- 
ing the value of the predicted result slot predicting that 
it is true in one arm of the conditional, and that it is false 
in the other. The predicate is then evaluated, and when 65 
it confirms, it side effects the predicate result slot, abort- 
ing one of the two branches. When aborting the execu- 
tion of conditional blocks, we must abandon execution 



of the block rather than update the variable and retry, as 
we would in normal block execution. 

The shared memory multiprocessor of the present 
invention, with each processor la-lc containing two 
fully associative caches effectively implements the 
above scheme. The first of the caches, the dependency 
cache 3a-3c, usually holds read data copies from main 
memory 6,7. This cache also watches bus transactions in 
a way similar to the snoopy cache coherency protocol 
as described in Goodman, "Using Cache Memory to 
Reduce Processor Memory Traffic" The 10th Annual 
International Symposium on Computer Architecture 
pp. 123-131 (1983). The second cache, the confirm 
cache 2a ,2c, holds only side effected data written by the 
associated processor, but not yet confirmed. 

Since the processors share a common memory bus 5, 
each cache can see all writes to main memory. The 
dependency cache acts as a normal data read cache, but 
also implements the dependency list and predicted 
value features. The confirm cache is used to allow pro- 
cessors to locally side effect their version of main mem- 
ory, without those changes being visible to other pro- 
cessors prior to their block confirmation. 

Each main memory location, therefore, can have two 
entries in a processor's caches, one in the dependency 
cache, and one in the confirm cache. This is because a 
knowledge must be maintained of what data the proces- 
sor's current computation has depended upon, in the 
dependency cache, while also allowing the processor to 
tentatively update its version of memory, in the confirm 
cache. For processor reads, priority is always given to 
the contents of the confirm cache if there is an entry in 
both caches, because one wants the processor to see its 
own modifications to memory, prior to them being 
confirmed. 

The : dependency cache performs three functions: 

1. It acts as a normal read data cache for main mem- 
ory data. 

2. It stores block dependency information. 

3. It holds predicted values of variables associated 
with this block. 

FIG. 5 shows a state transition diagram for the depen- 
dency cache as carried out by the state transition logic 
4a, 46, 4c. There are six states shown in the cache state 
diagram: INVALID, VALID, DEPENDS, PRE- 
DICT/VALID, PREDICT/INVALID and PRE- 
DICT/ABANDON. 

INVALID is the state associated with an empty 
cache line. 

VALID is the state representing that the cache holds 
a correct copy of a main memory date. 

DEPENDS indicates that the cache holds a correct 
data, and that the ongoing block computation is depend- 
ing on the continued correctness of this value. 

PREDICT/VALID indicates that the block is pre- 
dicting that, when it is ready to confirm, the value held 
in this cache location will continue to be equal to the 
value in main memory. This state indicates that the held 
value is now equal to main memory. This state differs 
from DEPENDS only in the action taken when a bus 
cycle modifies this memory location. 

PREDICT/INVALID indicates that the block is 
predicting that, when it is ready to confirm, the value 
held in this cache location will be equal to the value in 
main memory. The contents of main memory currently 
differs from the value held in the cache. 

PREDICT/ ABANDON indicates that the block 
execution is conditional on the eventual contents of 
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memory being side effected to be equal to that held in 
the cache. If it is side effected to some other value, the 
block will be aborted and not restarted (abandoned). 

At the start of each block execution, the dependency 
cache is initialized. This results in setting all locations to S 
either the INVALID or VALID states. Cache entries 
which are currently VALID, DEPENDS, or PRE- 
DICT/VALID are forced to the VALID state. All 
others are forced to the INVALID state. 

Each time the processor loads a data item into a regis- 10 
ter from the depends cache, the state of the cache line is 
potentially modified. If the cache entry for the location 
is the INVALID state, a memory bus read request is 
performed, a new cache line allocated, the data stored 
in the depends cache, and the line state is set to DE- 15 
PENDS. If the cache line is already VALID, then the 
state is simply set to DEPENDS. In all other states, the 
data is provided from the cache, and the state of the 
entry is unmodified. 

A cache miss is allowed to replace any entry in the 20 
cache which is in the INVALID or VALID states, but 
must not modify other entries. We require this because 
the dependencies are being maintained by these states. 
This implies that the cache must be fully associative, 
since otherwise a conflict over the limited number of 25 
sets available for a given location would make it impos- 
sible to store the dependency information. 

A Predict- Value request from the processor stores a 
predicted future value for a memory location into the 
dependency cache. The cache initiates a main memory 30 
read cycle for the same location. If the location contains 
the predicted value, the cache state is set to PRE- 
DICT/VALID. If the values disagree, the cache state is 
set to PREDICT/INVALID. 

A Predict- Abandon request from the processor stores 35 
a predicted future value of a predicate into the depends 
cache, and forces the cache line to the PREDICT/ A- 
BANDON state. 

A bus write cycle, taken by some other processor, 
can potentially modify a location in main memory 40 
which is held in the depends cache. Since there is a 
shared memory bus connecting all of the caches, each 
cache can monitor all of these writes. For INVALID 
cache entries, nothing is done. For VALID entries, the 
newly written data is copied into the cache, and the 45 
cache maintains validity. 

For DEPEND entries, the cache is updated, and if 
the new contents differs from the old contents, the run- 
ning block on the cache's associated processor is 
aborted and then restarted. This is the key mechanism 50 
for enforcing inter-block sequential consistency. 

Bus writes to locations which are PREDICT/- 
VALID or PREDICT/INVALID are compared with 
the contents of the depends cache. If the values agree, 
the cache state is set to PREDICT/VALID. If they 55 
disagree, the state is set to PREDICT/INVALID. 

Bus writes to locations which are PREDICT/ A- 
BANDON are compared with the contents of the de- 
pends cache. If they agree, no change occurs. If they 
disagree, the associated processor is aborted, and the 60 
block currently executing is permanently abandoned. 

The confirm cache preferably comprises a fully asso- 
ciative cache which holds only side-effected data. 
When the block is initialized, the confirm cache is emp- 
tied by invalidating all its entries. When the processor 65 
performs a side-effecting writ, the write data is held in 
the confirm cache, but not written immediately into 
main memory. There, it is visible to the processor which 



performed the side-effect, but to no other processors. 
The confirm cache has priority over the dependency 
cache in providing read data to the processor. If both 
hold the contents of a location, the data is provided by 
the confirm cache. This allows a processor to modify a 
location, and to have those modifications visible during 
further computation within the block. 

When the block counter 8 reaches a block associated 
with a particular processor, and the block has com- 
pleted execution, it is time to confirm that block. One 
final memory consistency check is performed. The load 
dependencies are necessarily satisfied, because if they 
were not, then the block would have been aborted. But 
the predicted- value dependencies jnay not be satisfied. 
These dependencies are checked by testing to see if 
there are any entries in the dependency cache which are 
in the PREDICT/INVALID state. Any entry in that 
state indicates a memory location whose contents was 
predicted to be one value, and whose actual memory 
contents are another. This block must be re-executed 
with the now-current value. 

After performing this final consistency check, the 
side effects associated with this block may be per- 
formed. This operation consists of sweeping the con- 
firm cache, and writing back to memory any modified 
entries. The write-back of these entries may, of course, 
force other processors to abort partially or completely 
executed blocks, if they have depended on the old val- 
ues of these locations. 

One consequence of using this cache strategy for 
implementing the dependency checking is that each 
block can perform multiple side-effects, and those side- 
effects will be visible only to the executing block until 
the block is confirmed. This enables each block to have 
more than just the terminal side-effect and allows the 
compiler' flexibility in choosing the optimal block size 
independent of how many side-effects are present 
within the block. 

Another important feature of using the cache as a 
dependency tracking mechanism is that when a transac- 
tion aborts, the valid entries in the cache remain, so that 
the re-execution of the block will likely achieve an 
almost 100% cache hit rate, reducing memory bus traf- 
fic and improving processor speed. 

The Lisp operation cons, although it performs a write 
into main memory, is not a side effecting instruction, 
since it is guaranteed that there is no other pointer to the 
location being written, and thus no other block can be 
affected by its change. In the block execution architec- 
ture, this can be implemented by providing each proces- 
sor with an independent free pointer. Consing and other 
local memory allocating writes done within a block are 
performed using the free pointer. The value of the 
pointer is saved when the block is entered, and abort 
rests the pointer to its entry value, automatically re- 
claiming the allocated storage. During the confirm, the 
free pointer's value is updated. All of these free pointer 
manipulations happen automatically if it is treated sim- 
ply as another variable whose value can be loaded and 
side effected. Writes of data into newly consed locations 
need not be confirmed. They can be best handled with 
a write-through technique in the dependency cache. It 
is important that the dependency cache not contain stale 
copies of data which has been written with a cons write 
operation. 

The addition of one primitive to the process instruc- 
tion set allows the re-introduction of explicit program- 
mer visible parallelism into the language supported by 
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this architecture. The primitive performs a load opera- 
tion without adding the location being loaded to the 
dependency list. In the depends cache, it reurns Valid 
data if present, and reads memory if necessary to pro- 
duce valid data. It never changes the state of the cache 5 
to Depends. 

With this primitive, it is possible to express algo- 
rithms which compute results based on potentially stale 
data. In those cases where this is acceptable from an 
algorithmic point of view, then parallel execution with 10 
no inter-block dependencies can be supported. One 
example is the Gauss- Jacobi parallel iterative matrix 
solution technique, as contrasted with the sequential 
Gauss-Seidel technique. 

In an alternative hardware implementation shown in 15 
FIG. 6, dependencies are stored using a hashing tech- 
nique rather than a fully associative cache. This imple- 
mentation may be attractive because the RAM used is 
less expensive than the fully associated storage used in 
the cache based design. 20 

The circuit includes processors 11a, 116, lie con- 
nected through the confirm cache 12a, 126, 12c to the 
shared system bus 16 with memories 15a, 156. The cir- 
cuit also includes a hash function box 13a, 126, 13c 
which receives the main memory address which is for 25 
example thirty-two bits and hashes it into a reproducible 
hashed address of for example 12 bits. The twelve bit 
address is used to address a 4096 X 1 RAM 14a, 146, 14c. 
At the start of an instruction block execution in proces- 
sor 11a, the Clear signal is asserted, and the RAM 14a is 30 
cleared, either by writing each location to a logic zero, 
or through special means for simultaneously clearing all 
locations. Each time processor 11a performs a memory 
load operation it asserts the write signal, and writes a 
logic one into RAM 14a at the location addressed by 35 
the hash function. Each time processor 116 terminates a 
block by performing its last side effect, the memory 
store from processor 116 to memory 15a or 15c is visible 
to processor 11a. Hash box 13a calculates the hashed 
address, and examines the bit in RAM 14a correspond- 40 
ing to the hashed address being written by processor 
116. It this bit is a logic one, then the block just con- 
firmed on processor 116 has potentially (but not neces- 
sarily) invalidated by computation being performed by 
processor 11a. Processor 11a must abort, clear RAM 45 
14a, and retry executing the block. 

What is claimed is: 

1. In a parallel processing system receptive of a pro- 
gram and having at least two processors connected in 
parallel to a shared main memory, wherein each proces- 50 
sor executes instructions of a program which includes 
side effecting instructions which modify the contents of 
a location in main memory and functional instructions 
which reference locations in main memory or operate 
only on internal processor registers, the improvement 55 
comprising: means for compiling a program into a series 
of independent instruction blocks including predomi- 
nantly functional instructions and terminating in a side 
effecting instruction; means for applying compiled 
blocks to the at least two processors for executing the 60 
blocks in parallel; and means for validating data to be 
stored in a location of main memory from each parallel 
processed block relative to locations referenced during 
execution by the other of the at least two processors, 
wherein the validating means comprises, for each pro- 65 
cessor, means for maintaining a dependency list of all 
locations in main memory which have been referenced 
during the execution of a block therein. 



2. The system according to claim 1, wherein the vali- 
dating means comprises first cache means associated 
with each processor for temporarily storing the location 
and contents of each location in main memory to be side 
effected by the execution of the block therein and means 
for confirming each block in the order of the blocks as 
specified in the program wherein the validity of the 
contents of the first cache means are approved compris- 
ing means for comparing the dependency list of the 
block to be confirmed with the locations side effected 
by already confirmed blocks to detect a match and 
means for aborting the execution of the block to be 
confirmed if there is a match. 

3. The system according to claim 2, wherein the con- 
finning means comprises a block counter and means for 
incrementing the block counter to indicate the next 
block to be confirmed. 

4. The system according to claim 2, wherein the 
means for maintaining a dependency list comprises sec- 
ond cache means associated with each processor for 
temporarily storing each location and contents of main 
memory referenced by a block being executed in the 
processor and wherein the confirming means comprises 
means for updating the contents of the second cache 
means upon the aborting of a block and for effecting the 
reexecution of an aborted block with the side effected 
location from an already confirmed block. 

5. The system according to claim 4, wherein the first 
and second cache means each comprises a fully associa- 
tive content addressable memory. 

6. The system according to claim 4, wherein the sec- 
ond cache means comprises an addressable random 
access memory for storing dependency information of 
main memory locations and hash table means receptive 
of said addresses of the main memory locations for 
hashing them into a reduced member of address bits for 
addressing the random access memory. 

7. The system according to claim 1, wherein the 
means for maintaining a dependency list includes means 
for selectively referencing locations in main memory 
without adding same to the dependency list. 

8. In a method for executing a program in at least two 
processors connected in parallel to a shared main mem- 
ory, wherein the program includes side effecting in- 
structions which modify the contents of a location in 
main memory and functional instructions which refer- 
ence locations in main memory, the improvement com- 
prising: compiling the program into a series of indepen- 
dent instruction blocks including predominantly func- 
tional instructions and terminating in a side effecting 
instruction, applying compiled blocks to the processors, 
executing the blocks in parallel and validating data to be 
stored in a location of main memory from each parallel 
processed block relative to locations referenced during 
execution by the other of the at least two processors by 
maintaining a dependency list for each processor of all 
locations in main memory which have been referenced 
during the execution of a block therein. 

9. The method according to claim 8, wherein the step 
of validating further comprises for each processor tem- 
porarily storing the location and contents of each loca- 
tion in main memory to be side effected by the execu- 
tion of the block therein and confirming each block in 
the order of the blocks as specified in the program to 
approve the validity of the stored contents by compar- 
ing the dependency list of the block to be confirmed 
with the locations which have been side effected by the 
already confirmed blocks to detect a match. 
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[57] ABSTRACT 

A parallel processing system is receptive of a program 
and has at least two processors connected in parallel to 
a shared main memory. Each processor executes in- 
structions of the program which includes side-effecting 
instructions which modify the contents of a location in 
main memory and functional instructions which refer- 
ence locations in main memory. The program is com- 
piled into a series of independent instruction blocks 
each of which includes predominantly functional in- 
structions and terminates in a side-effecting instruction 
and the processors execute the blocks in parallel. 

13 Claims, 4 Drawing Sheets 






5H0 
9& 



6, 



2a, 



PRocessoa 



/b- 



/=)ROCe3SCK 



IS. 



CACH£ 



\& f>, & 









CO/WC£K 



¥ 



/fc- 
36 46, 



/vtocfssa? 






c ? r 



1_ 



± 



3L0CK 






AIA/M <+f£M7/?y 



fSc 



CAC//£= 



BC/SS 



MAt/V/U£A4aRy 



«£- 



r. 



STAT6 
77?AMS. 
LOG/C 



FIG. i 



U.S. Patent Apr. 25, 



1989 



W 



Sheet 1 of 4 4,825,360 






<§ 






I 



& 



^11 






i? 



«J 



§5 



4 



S> 



rv 



to'* 



IP 



5 
IP 
I 



5" 






Up?; 












* 



tf 



i 



m 

(ON \J 



Jo 



T 






1 



1 
I 









v; 



& 



is 

<0 




U.S. Patent 



5 



Apr. 25, 1989 Sheet 2 of 4 

LOA& /%?. C 
LOA0 ft/, 
ADD fiO,/?/, /?/ 



4,825,360 



MOSTLY FUHCr/0/VA L 



S7VRe fit. A 3/pe &ttc r/wsA&rxoor/ot/ 

FIG.2 



U?A0S?(?,C 
LOA0f??,& 
gL&CK F0/tfT£# A0O f?aW> ff/ 

S7&/?£f?lA 



/c, 



) 



UP/10 ft/, B 



/ 



0L0CK 2. 



\ 



loap #a A 

£\0A0 #/, & 
d00RO,Rt,#l 

swfte f?t a 



FIG.3 



£X£CUT/AfG ON FS?OC£S50f? 

/C 



M 



4,825,360 
1 2 

algorithm in terms of the internal state of a set of objects 

SYSTEM AND METHOD FOR PARALLEL as, for example, discussed in chapter 3 of Abelson and 

PROCESSING WITH MOSTLY FUNCTIONAL Sussman, Structure and Interpretation of Computer 

LANGUAGES Programs, M.I.T. Press, Cambridge, (1985). This object 

5 oriented viewpoint in programming language design is 

BACKGROUND OF THE INVENTION incompatible with the strictly functional approach to 

The present invention relates to a computer system computation. The same algorithms can be implemented 
and a method for parallel processing program instruc- m either style, and both systems are surely univer- 

tions. sal — but the fact that programmers find one representa- 

Many parallel architectures start by defining the pro- 10 tion for programs easier to think about, easier to design, 

gramming model they support as strictly functional. and easier to debug is, in itself, a powerful motivation 

Computer architects find that the strictly functional for a programming language to provide that representa- 

approach makes their task easier, because the order in tion. 

which computation is performed, once the basic depen- If this were merely a representation issue, there 

dencies are satisfied, is irrelevant. Building highly paral- 15 would be hope that suitable compiler technology could 

lei machines which execute strictly functional Ian- eventually lead to a programming language which pro- 

guages then becomes relatively straightforward. vides support for side effects, but which compiles into 

Strictly functional languages certainly have advantages, purely functional operations. As has been discussed 

including the relative ease of proving their correctness above, however, in the general case this is not possible, 

and the ease of thinking and reasoning about certain 20 side effects cause problems. Architects find it diffi- 

simple kinds of programs, as is disclosed in Backus, cu i t to build parallel architectures for supporting them. 

"Can Programming be Liberated from the von Neu- Verification software finds such programs difficult to 
mann Style? A Functional Style and its Algebra of reason about. And programming styles, with an abun- 

Programs" Communications of the ACM Vol. 21 no. 8 4^^ of side effects ^ m difficu i t t0 understand, modify 

pp. 613-641 (August 1978). 25 ^ ^^ 

Unfortunately, there is a class of algorithms for In accordance with the present invention, instead of 
which the strictly functional programming style ap- completely eliminating side effects, it is proposed that 
pears to be inadequate. The problems are rooted m the ^ ^ ^ ^^ curtailed. Most side effects in con- 
difficulty in interfacing a strictly functional program to ventional programmmg languages like Fortran are gra- 
a non-functional outs.de envu-onment. It is impossible to 30 ^ m not solvinj difficult multi-tasking 
program a task such as an operating system, for in- . J ., XT ° .. ..° 
stance, in the unmodified s£ctly functional style. commumcations problems. Nor are they improving the 
Clever modifications of the strict functional style have lar «« ^ «*>**"* or c anty of the program. In- 
been proposed which cure this defect. For example, st ^' thev m "^J 01 mvlal P 1 ^** such M n P datu, 8 
McCarthy's AMB operator, which returns whichever 35 a 1o °P mde * variable, 

of two inputs are ready first, appears to be adequate to Elimination of these unnecessary side effects can 

solve this and similar problems. Several equivalent op- J"*? <=°de more readable, more maintainable, and as set 

erators have been proposed in Peter Henderson, "Is It f °^ hereinafter m accordance with the invention, 

Reasonable to Implement a Complete Programming faster to execute. 

System in a Purely Functional Style? " Technical memo 40 Manv of these s™ 6 ***** motivate Halstead s Multi- 

PMM/94, The University of Newcastle upon Tyne Lisp, a parallel variant of Scheme as disclosed m "Mul- 

Computing Laboratory (December 1980). tiLisp: A Language for Concurrent Symbolic Computa- 

However, the introduction of such non-functional tion " ACM Transactions on Programming Languages 

operators into the programming language destroys and Systems, (1985). The approach Multi-Lisp takes to 

many of its advantages. Architects can no longer ignore 45 providing access to parallel computation is the addition 

the order of execution of functions. Proving program of programmer visible primitives to the language. The 

correctness again becomes a very much more difficult three primitives which distinguish it from conventional 

task. Users also find the programming with such opera- Lisp are: 

tors is tedious and error prone. In particular, as shown The future primitive which allows an encapsulated 

by Agha in "Actors: A Model of Concurrent Computa- 50 value to be evaluated, while simultaneously returning to 

tion in Distributed Systems" M.I.T. Artificial Intelli- its caller a promise for that value. The caller can con- 

gence Laboratory Technical Memo 844, (1985), the tinue to compute with this returned object, incorporat- 

presence of such an operator allows the definition of a hig it into data structures, and passing it to other func- 

new data type, the cell, which can be side effected in tkms. Only when the value is examined is it necessary 

just as normal memory locations in non-functional Ian- 55 for the parallel computation of its value to complete; 

guages. Thus the introduction of the AMB operator, the pcall primitive which allows the parallel evalua- 

while adequate to solve the lack of side effect problem, tion of arguments to a function. Halstead teaches that v 

re-introduces many of the problems which the func- pcall can be implemented as a simple macro which r O / 

tional language advocates are attempting to avoid. expands into a sequence of futures. It can be thought of \fV 

Therefore conventional thought indicated the choice 60 as syntactic sugar for a stylized use of futures; and 

of either defining weak systems, such as purely func- the delay primitive which allows a programmer to 

tional languages, about which one can prove theorems, specify that a particular computation be delayed until 

or defining strong systems which one can prove little the result is needed. Similar in some respects to a future, 

about. delay returns a promise to compute the value, but does 

Much more serious than any of the theoretical diffi- 65 not begin computation of a value until the result is 

culties of strictly functional languages, is the difficulty needed. Thus the delay primitive is not a source of 

programmers face in implementing certain types of parallelism in the language. It is a way of providing lazy 

programs. It is often more natural to think about an evaluation semantics to the programmer. 
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Both delayed future result in an order of execution programmer that the program is being executed sequen- 
different from applicative order computation. In the tially. This automatic extraction of parallelism makes 
absence of side effects, both will result in the same value the architecture attractive because one can exploit the 
as the equivalent program without the primitives, since existing programming tools, languages and algorithms 
they affect only the order in which the computation is 5 which have been developed over the past several dec- 
performed. This is not strictly true for delay since its ades of computer science research, while gradually 
careful use can allow otherwise non-terminating pro- endeavoring to improve them for the parallel environ- 
grams to return a value. ment. While not all of the benefits of parallel execution 

In the presence of side effects, it is difficult to predict will be attained with this architecture, the ability to 

the behavior of a future, since its value may be com- 10 extract even modest amounts of parallel execution from 

puted in parallel with other computations. existing programs may be worth the effort. Moreover, 

While the value of a delay is deterministic, since it the architecture can be extended, to provide some ex- 
does not introduce additional parallelism, its time of plicit and modular programmer control over the paral- 
computation is dependent on when its value is first lelism. 

examined. This can be very non-intuitive and difficult to 15 In achieving the invention, just as hardware has pro- 
think about while writing programs. vided important runtime support for type checking, it is 

Halstead implements the future and delay primitives proposed to use hardware in the runtime checking for 

by returning the caller an object of data-type future. future declarations. 

This object contains a pointer to a slot, which will even- In a proposed Lisp implementation, explicit language 
tually contain the value returned by the promised com- 20 extensions such as future and pcall are not used. Instead, 
putation. The future may be stored in data structures it is assumed that each evaluation is encapsulated in a 
and passed as a value. Computations which attempt to future and that each call can freely evaluate its argu- 
reference the value of the future prior to its computa- ments in parallel. The .approach is to detect and correct 
tion are suspended. When the promised computation those cases in which this assumption is unjustified, 
completes, the returned value is stored in the specified 25 In accordance with the invention a compiler trans- 
slot, and the future becomes determined. This allows forms the program into sequences of compiled primitive 
any pending suspended procedures waiting for this instructions, called transaction blocks. Each transaction 
value to run, and any further references to the future block has a mostly functional portion and terminates in 
will simply return the now computed value. a side effect. While the mostly functional portion prefer- 

The use of future or delay can be thought of as a 30 ably has no side effecting instructions, the system will 

declaration. Their use declares that either the computa- operate effectively as long as at least the majority of the 

tion done within the scope of the delay or future has no instructions are functional, that is only reference but do 

side effects, or that, if it does, the order in which those not modify main memory. 

side effects are done relative to other computations is Each block is independent of previous execution, 

irrelevant. There is also a guarantee with such a decla- 35 except for the contents of main memory. In particular, 

ration that no free or shared variable referenced by the no registers or other internal bits in the processor are 

computation is side effected by some other parallel- shared across blocks, 

executing portion of the program. Each block contains exactly one side-effecting store 

Like all declarations, the use of future or delay is a into main memory, occurring at the end of the block, 

very strong assertion. In a way similar to type declara- 40 Each transaction block is, from the standpoint of the 

tions, their use is difficult to check automatically, is memory system, a mostly functional program consisting 

error prone, has a significant performance impact if of register loads and register-to-register instructions, 

omitted, and may function correctly for many test cases, intermediate side effecting stores and ending with side 

but fail unexpectedly on others. effecting store into main memory. The results of the 

Advocates of strong type checking in compiled Ian- 45 side effecting stores are temporarily stored, before 
guages have attempted for years to build compilers being applied to main memory, 
capable of proving the type correctness of programs. It Each of these blocks can be executed independently, 
is believed that they have failed. All languages sophisti- on several different processors. Moreover, there is no 
cated enough for serious programming require at least interaction between blocks until the blocks are con- 
some dynamic checks for type safety at execution time. 50 firmed, that is the results are validated. However, the 
These checks are implemented as additional instructions order in which the blocks are confirmed is critical. The 
on conventional architectures, or as part of the normal side effects must be executed in a well defined order: the 
instruction execution sequence on more recent architec- order specified in the program. Further, the execution 
tures as discussed in Moon, "The Architecture of the of side effect can potentially modify a location which 
Symbolics 3600" The 12th Annual International Sym- 55 some other block has already read, 
posium on Computer Archecture pp. 76-83 (1985). One These two tasks — confirming in order, and enforcing 
alternative, declaring the types of relying on the word the dependencies of later blocks on earlier side effect- 
of the programmer, while leading to good performance s — require special hardware in accordance with the 
on conventional architectures, is dangerous, error- invention. 

prone, and inappropriate for sophisticated modern pro- 60 Each processor is free to execute its block at will. The 

gramming environments. results of the side effects are stored in a cache associated 

rmni.nvnr _,„ „„ ™»^,^»r with the processor. As a result, this execution cannot . „ 

SUMMARY OF THE INVENTION affect other proossors. A block counter is maintained, <\l\ 

The main object of the present invention is to elimi- similar in purpose to a normal program counter. The «/ ' J 

nate the disadvantages of the prior art systems. 65 block counter's job is to keep track of which block is 

The present invention relates to a computer system next allowed to be confirmed. As the block counter 

having the ability to execute several portions of a pro- reaches each block in turn, the block compares its refer- 

gram in parallel, while giving the appearance to the enced locations with modified location of already con- 
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firmed blocks. If there is no comparison, this block is The system accordingly further comprises for each 
confirmed. Confirming a block allows the results stored processor, means for maintaining a dependency list of 
in the cache to be sent to main memory and modify the all locations in main memory which have been refer- 
contents of main memory. Other blocks, not yet con- enced during the execution of a block therein, first 
firmed, may already have referenced the location which 5 cache means associated with each processor for tempo- 
has just been modified. If they have, then the data upon rarily storing the location and contents of each location 
which they are computing is now invalid when they are in main memory to be side effected by the execution of 
confirmed. the block in the associated processor and means are 

But the program which they are executing has not provided for confirming each block in the order of the 

modified main memory since the results are stored in a 10 blocks as specified in the program wherein the validity 

cache. Thus the system is free to abandon the uncon- of the contents of the first cache means are approved, 

finned executed block, and to attempt re-executing it a The confirming means comprise means for comparing 

second time. This abandoning of an unconfirmed exe- the dependency list of the block to be confirmed with 

cuted block is called aborting the block. the locations side effected by already confirmed blocks 

In order to detect when the block needs to be 15 to detect a match and means for aborting the execution 
aborted, one needs to keep track of which memory of the block to be confirmed if there is a match, 
locations it has referenced. A dependency list is there- In a preferred embodiment, the confirming means 
fore maintained, containing the addresses of all loca- comprises a block counter and means for incrementing 
tions in main memory which have been referenced dur- the block counter to indicate the next block to be con- 
ing execution of a block. When a processor executes a firmed. The means for maintaining the dependency list 
load instruction, it adds the address being loaded to the comprises second cache means associated with each 
executing block's dependency list. processor for temporarily storing each location and 

When a block is confirmed, the address it is side ef- contents of main memory referenced by a block being 
fecting is broadcast to all other executing processors. executed in the processor. The confirming means fur- 
Each processor checks its dependency list to determine ther comprises means for updating the contents of the 
if the block it is executing has referenced the modified second cache means upon the aborting of a block and 
location. If it has, the block is aborted and re-executed. for effecting the reexecution of an aborted block with 

Eventually, a given block will be reached by the the side effected location from an already confirmed 

block counter, and it will be allowed to complete and 3Q block. 

perform its side-effect. The block counter advances to In one embodiment the first and second cache means 

the next block, and if the block is ready, that block is each comprises a fully associative content addressable 

confirmed in the following cycle. If the confirmation of memory. In the alternative embodiment, the second 

one block aborts the execution of successor blocks, cache means comprises an addressable random access 

however, the performance will be limited by the time it 3 5 memory for storing the dependency information of 

takes the aborted block to re-execute its program. main memory locations and hash table means receptive 

The performance improvement resulting from this of said addresses of the main memory locations for 

technique depends on two factors: the average block hashing them into a reduced member of address bits for 

length, and the frequency of aborts. The longer the addressing the random access memory, 

block, the fewer confirms are necessary to execute a ^ In accordance with the present invention, the method 

given program. The more frequent the aborts, the more for executing a program in at least two processors con- 

the block counter must await a sequential computation nected in parallel to a shared main memory, comprises 

before confirming the next block. compiling the program into a series of independent 

One can see now the influence of reducing the num- instruction blocks including predominantly functional 

ber of side effects in the source program: as the number 45 instructions and terminating in a side-effecting instruc- 

of side effects is reduced, the length of the blocks can be tion, applying the blocks to the processors and execut- 

increased. Since the blocks are longer, fewer blocks ing the blocks in parallel. 

need be executed to perform a computation. Since the The step of executing comprises maintaining a depen- 

block execution is sequential, at a rate limited to a maxi- dency list for each processor of all locations in main 

mum of one per clock cycle, this limits the performance 50 memory which have been referenced during the execu- 

of the architecture. There is probably a maximum desir- tion of a block therein, for each processor temporarily 

able block length, since as the block size increases, the storing the location and contents of each location in 

length of the dependency list, and thus the likelihood of main memory to be side effected by the execution of the 

a side effect from another block influencing the compu- block therein and confirming each block in the order of 

tation goes up. 55 the blocks as specified in the program to approve the 

In accordance with the invention, the parallel pro- validity of the temporarily stored contents by compar- 

cessing system of the invention is receptive of a pro- ing the dependency list of the block to be confirmed 

gram and has at least two processors connected in paral- with the locations which have been side effected by the 

lei to a shared main memory. Each processor executes already confirmed blocks to detect a match, 

instructions of the program which includes side effect- 60 When there is a match the execution of the block is 

ing instructions which modify the contents of a location aborted. 

in main memory and functional instructions which ref- The step of maintaining a dependency list preferably 

erence locations in main memory. Means are provided comprises temporarily storing, for each processor, each 

for compiling a program into a series of independent location and contents of main memory referenced by 

instruction blocks including predominantly functional 65 the block being executed therein and wherein the step 

instructions and terminating in a side-effecting instruc- of executing further comprises reexecuting an aborted 

tion and wherein the processors execute the blocks in block with the side effected location from an already 

parallel. confirmed block. 
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In another-embodiment, the step of temporarily stor- memory is stored in the dependency cache 3a-3c 

ing the referenced locations comprises hashing each whereas each side effecting instruction is carried out 

location address into a hashed address having a reduced and written only into the confirm cache 2a-lc 

number of address bits and storing a logic one at the Thus each processor is free to execute its block at will 

hashed address in a random access memory. 5 since all referenced locations and all locations to be 

Moreover, the method includes selectively omitting modified are stored in the caches 2a-2c and 3a-3c with- ., ^n * 

referenced locations from the dependency list when out the contents of main memory 6, 7 being affected. *\ 'o A 

desired. In order to validate the data for being written into 

These and other features of the invention will be main memory to modify the contents thereof, each 

described in the following description with reference to 10 block must first be confirmed. The blocks are confirmed 

the drawings, wherein: in the order specified in the overall program and this is 

BRIEF DESCRIPTION OF THE DRAWINGS °"™* ° Ut b * *" bk5ck C0Un f ! wW ,ch specifies the 

order m which the processors la-lc will be confirmed. 

FIG. 1 is a block diagram of a system according to The dependency caches 3a-3c, by keeping track of all 

the invention; 15 memory locations referenced by the processor during 

FIG. 2 shows an example of a transaction block ac- the execution of a block, thereby maintains a depen- 

cording to the invention; dency list of the data which is depended on by the block 

FIG. 3 shows an example of several blocks executing in order to achieve the execution thereof. After the first 

in parallel; block is confirmed, all of the side effected locations 

FIG. 4 is a call tree for guiding the block starting 20 stored in the confirmed cache corresponding thereto is 

process according to the invention; placed out on the bus and received by the dependency 

FIG. 5 is a state transition diagram of the state transi- caches of the other blocks to see if any of these side 

tion logic for the dependency cache; effected locations were depended upon by the blocks 

FIG. 6 is an alternate embodiment for the depen- during execution. If this is the case and there is a match, 

dency cache shown in FIG. 1. 25 then the executed unconfirmed block must be aborted 

DETAILED DESCRIPTION OF THE andreexecuted with the updated side effected location 

In a preferred embodiment of the present invention, 
Referring now to FIG. 1, the system in accordance the confirm and dependency caches are fully associative 
with the present invention includes processors la, lb, lc 30 content addressable memories which enable the corn- 
connected in parallel to common bus 5 to share main parison to be made without any additional structure. 



/) ps memories 6 and 7. It should be noted that while three FIG. 3 shows an example of several blocks executing 

"J \A processors are shown in parallel for the purpose of in parallel. As can be seen from this example, block 3 

example, the system according to the present invention will abort when the first block is confirmed since block 

can utilize at least two processors in parallel up to an 35 3 references the variable A which has been side effected 

indefinite number in accordance with the method and in block 1. 

system of the present invention. It should also be noted Since the system is simulating the behavior of a uni- 
that the present invention is also capable of operating in processor by using parallel hardware, there is a defined 
a quasi parallel processor of the type wherein a single order in which the blocks must be confirmed. Execution 
processor which executes two or more programs in an 40 of a given block should start early enough so that its 
interleaved manner is also contemplated as being within results are ready by the time that block is reached by the 
the scope of the present invention. block counter 8. 
/o A Connected between each processor and the bus 5 are FIG. 4 shows a call tree of a process which provides 
ll" I a confirm cache 2a, lb, lc and a dependency cache 3a, heuristic data to guide the block starting process. Here, 
LI ^ 1 ^' *° w k*°k k* 8 associated state transition logic 4a, 46, 45 it is desirable to start execution of blocks 2 and 3, while 
*\ 4c connected thereto. Processors la, 16 and lc also have the block counter is at block 1, because these are the 
• j'tjn connections to a block counter 8, which will be de- next sequential blocks. One would also want to start 
*\ <* scribed hereinafter. execution of block 6, because it is unlikely that confirm- 
'o' fi_ ,| The system also includes compiler 9 connected to the ing blocks 1-5 will side effect data which block 6 de- 
bus in a conventional manner for transforming an input 50 pends upon. 

program into sequences of compiled primitive instruc- But starting execution of a block earlier is potentially 

tions called transaction or instruction blocks. bad, because it will reference data which is older than 

Each block, as shown in FIG. 2, has a mostly func- necessary and because the processor being used to exe- 

tional portion and terminates in a side effecting store cute it may be needed to execute a block which must be 

instruction. As noted, the functional instructions in- 55 confirmed sooner. 

elude those which reference main memory to carry out These observations lead to the following block execu- 

the operations thereof or operate only on internal pro- tion strategy: 

cessor registers, while the side effecting instructions are 1. Start executing blocks ahead of this block, or 

those which require modification of the contents of a downward and to the left in the call tree, 

location in main memory. 60 2. Start executing blocks which branch downward to 

The mostly functional portion can have no side ef- the right in the call tree, 

fecting instructions or one or more side effecting in- 3. If additional processors are required for (1) or 

structions, as long as at least a majority of the instruc- (2), abort execution of the furthest right branches of 

tions thereof are functional. the call tree, and reclaim their processors. 

In accordance with the method of the present inven- 65 Heuristic (1) is very similar to instruction pre-fetch- 

tion, after the compiler 9 has broken up the program ing on conventional pipelined architectures. Since we 

into the blocks, the blocks are applied to the processors are executing this block, we will next need to execute its 

la-lc for execution in parallel. Each reference to main successor. 
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Heuristic- (2) depends on the fact that parallel execu- 
tion of evaluations from the same level of the call tree 
are often independent, and can often be executed with- 
out conflicting side effects. 

Heuristic (3) is a simple consequence of the fact that 5 
we must finish execution of the bottom left portion of 
the call tree prior to executing the right portion of the 
call tree. Thus, we are better off using the processors 
executing blocks to the right for more immediate needs. 

These heuristics give a very good handle on the diffi- 10 
cult problem of resource allocation in the multiproces- 
sor environment. 

The presence of conditionals in the language, how- 
ever, makes the tidy resource allocation scheme de- 
scribed above much more difficult. Since one does not 15 
know which of the two paths a conditional will follow 
until the data computation is confirmed, one must start 
blocks executing down both branches, or risk delaying 
execution waiting for a block on the path not followed. 
Fortunately, one is free to start executing down both 20 
paths, since until a block is confirmed it does not side 
effect main memory. The blocks started along the unfol- 
lowed branch of a conditional are simply aborted with- 
out confirming. 

Allocation of processor resourses in the presence of 25 
conditionals is substantially more difficult, since one 
cannot, except heuristically, predict which of two paths 
is more worthwhile for expending processsor resources. 

Loops are a special case of conditional execution. As 
with most parallel processors, there is an advantage in 30 
knowing the high level semantics of a loop construct. 
Rather than attempting to execute repetitive copies of 
the same block, incrementing the loop index as a setup 
for the next pass, it is far preferably to know that a loop 
is required, and that the loop is, for example, indexing 35 
over a set of array values from to 100. One can then 
execute 101 copies of the loop block, each with a unique 
value of the loop index. The unpalatable alternative is to 
rely on the side effect detection hardware to notice that 
the loop index is changed in each block iteration, result- 40 
ing in an abort of the next block on every loop iteration. 

Unlike most of the parallel proposals for loop execu- 
tion, however, the systems herein can also handle the 
hard case where the execution of the lop body modifies 
the loop control parameters, perhaps in a very indirect 45 
or conditional way. 

The way the system deals with both conditional exe- 
cution and with loop iteration involves prediction of the 
future value of a variable. In a sense, the dependency list 
maintenance already described is a form of value pre- 50 
diction: one predicts that the value one is depending 
upon will not change by the time the executing block 
confirms. 

One can extend this idea to one which predicts the 
future value of variables likely to be modified from their 55 
current value prior to executing the block. One applica- 
tion of this technique is to the problem of loop itera- 
tions. Instead of relying on the current value of the loop 
index as the predicted value for the next iteration, one 
desires to predict the future value — predicting a differ- 60 
ent future value for each parallel body block started. 

Similarly, one can implement conditionals by predict- 
ing the value of the predicted result slot predicting that 
it is true in one arm of the conditional, and that it is false 
in the other. The predicate is then evaluated, and when 65 
it confirms, it side effects the predicate result slot, abort- 
ing one of the two branches. When aborting the execu- 
tion of conditional blocks, we must abandon execution 



of the block rather than update the variable and retry, as 
we would in normal block execution. 

The shared memory multiprocessor of the present 
invention, with each processor la-lc containing two 
fully associative caches effectively implements the 
above scheme. The first of the caches, the dependency 
cache 3a-3c, usually holds read data copies from main 
memory 6,7. This cache also watches bus transactions in 
a way similar to the snoopy cache coherency protocol 
as described in Goodman, "Using Cache Memory to 
Reduce Processor Memory Traffic" The 10th Annual 
International Symposium on Computer Architecture 
pp. 123-131 (1983). The second cache, the confirm 
cache la ,2c, holds only side effected data written by the 
associated processor, but not yet confirmed. 

Since the processors share a common memory bus 5, 
each cache can see all writes to main memory. The 
dependency cache acts as a normal data read cache, but 
also implements the dependency list and predicted 
value features. The confirm cache is used to allow pro- 
cessors to locally side effect their version of main mem- 
ory, without those changes being visible to other pro- 
cessors prior to their block confirmation. 

Each main memory location, therefore, can have two 
entries in a processor's caches, one in the dependency 
cache, and one in the confirm cache. This is because a 
knowledge must be maintained of what data the proces- 
sor's current computation has depended upon, in the 
dependency cache, while also allowing the processor to 
tentatively update its version of memory, in the confirm 
cache. For processor reads, priority is always given to 
the contents of the confirm cache if there is an entry in 
both caches, because one wants the processor to see its 
own modifications to memory, prior to them being 
confirmed. 

The dependency cache performs three functions: 

1. It acts as a normal read data cache for main mem- 
ory data. 

2. It stores block dependency information. 

3. It holds predicted values of variables associated 
with this block. 

FIG. 5 shows a state transition diagram for the depen- 
dency cache as carried out by the state transition logic 
4a, 4b, Ac. There are six states shown in the cache state 
diagram: INVALID, VALID, DEPENDS, PRE- 
DICT/VALID, PREDICT/INVALID and PRE- 
DICT/ABANDON. 

INVALID is the state associated with an empty 
cache line. 

VALID is the state representing that the cache holds 
a correct copy of a main memory date. 

DEPENDS indicates that the cache holds a correct 
data, and that the ongoing block computation is depend- 
ing on the continued correctness of this value. 

PREDICT/VALID indicates that the block is pre- 
dicting that, when it is ready to confirm, the value held 
in this cache location will continue to be equal to the 
value in main memory. This state indicates that the held 
value is now equal to main memory. This state differs 
from DEPENDS only in the action taken when a bus 
cycle modifies this memory location. 

PREDICT/INVALID indicates that the block is 
predicting that, when it is ready to confirm, the value 
held in this cache location will be equal to the value in 
main memory. The contents of main memory currently 
differs from the value held in the cache. 

PREDICT/ABANDON indicates that the block 
execution is conditional on the eventual contents of 
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memory beingside effected to be equal to that held in 
the cache. If it is side effected to some other value, the 
block will be aborted and not restarted (abandoned). 

At the start of each block execution, the dependency 
cache is initialized. This results in setting all locations to 5 
either the INVALID or VALID states. Cache entries 
which are currently VALID, DEPENDS, or PRE- 
DICT/VALID are forced to the VALID state. All 
others are forced to the INVALID state. 

Each time the processor loads a data item into a regis- 10 
ter from the depends cache, the state of the cache line is 
potentially modified. If the cache entry for the location 
is the INVALID state, a memory bus read request is 
performed, a new cache line allocated, the data stored 
in the depends cache, and the line state is set to DE- IS 
PENDS. If the cache line is already VALID, then the 
state is simply set to DEPENDS. In all other states, the 
data is provided from the cache, and the state of the 
entry is unmodified. 

A cache miss is allowed to replace any entry in the 20 
cache which is in the INVALID or VALID states, but 
must not modify other entries. We require this because 
the dependencies are being maintained by these states. 
This implies that the cache must be fully associative, 
since otherwise a conflict over the limited number of 25 
sets available for a given location would make it impos- 
sible to store the dependency information. 

A Predict- Value request from the processor stores a 
predicted future value for a memory location into the 
dependency cache. The cache initiates a main memory 30 
read cycle for the same location. If the location contains 
the predicted value, the cache state is set to PRE- 
DICT/VALID. If the values disagree, the cache state is 
set to PREDICT/INVALID. 

A Predict-Abandon request from the processor stores 35 
a predicted future value of a predicate into the depends 
cache, and forces the cache line to the PREDICT/ A- 
BANDON state. 

A bus write cycle, taken by some other processor, 
can potentially modify a location in main memory 40 
which is held in the depends cache. Since there is a 
shared memory bus connecting all of the caches, each 
cache can monitor all of these writes. For INVALID 
cache entries, nothing is done. For VALID entries, the 
newly written data is copied into the cache, and the 45 
cache maintains validity. 

For DEPEND entries, the cache is updated, and if 
the new contents differs from the old contents, the run- 
ning block on the cache's associated processor is 
aborted and then restarted. This is the key mechanism 50 
for enforcing inter-block sequential consistency. 

Bus writes to locations which are PREDICT/- 
VALID or PREDICT/INVALID are compared with 
the contents of the depends cache. If the values agree, 
the cache state is set to PREDICT/V ALID. If they 55 
disagree, the state is set to PREDICT/INVALID. 

Bus writes to locations which are PREDICT/ A- 
BANDON are compared with the contents of the de- 
pends cache. If they agree, no change occurs. If they 
disagree, the associated processor is aborted, and the 60 
block currently executing is permanently abandoned. 

The confirm cache preferably comprises a fully asso- 
ciative cache which holds only side-effected data. 
When the block is initialized, the confirm cache is emp- 
tied by invalidating all its entries. When the processor 65 
performs a side-effecting writ, the write data is held in 
the confirm cache, but not written immediately into 
main memory. There, it is visible to the processor which 



performed the side-effect, but to no other processors. 
The confirm cache has priority over the dependency 
cache in providing read data to the processor. If both 
hold the contents of a location, the data is provided by 
the confirm cache. This allows a processor to modify a 
location, and to have those modifications visible during 
further computation within the block. 

When the block counter 8 reaches a block associated 
with a particular processor, and the block has com- 
pleted execution, it is time to confirm that block. One 
final memory consistency check is performed. The load 
dependencies are necessarily satisfied, because if they 
were not, then the block would have been aborted. But 
the predicted-value dependencies jnay not be satisfied. 
These dependencies are checked by testing to see if 
there are any entries in the dependency cache which are 
in the PREDICT/INVALID state. Any entry in that 
state indicates a memory location whose contents was 
predicted to be one value, and whose actual memory 
contents are another. This block must be re-executed 
with the now-current value. 

After performing this final consistency check, the 
side effects associated with this block may be per- 
formed. This operation consists of sweeping the con- 
firm cache, and writing back to memory any modified 
entries. The write-back of these entries may, of course, 
force other processors to abort partially or completely 
executed blocks, if they have depended on the old val- 
ues of these locations. 

One consequence of using this cache strategy for 
implementing the dependency checking is that each 
block can perform multiple side-effects, and those side- 
effects will be visible only to the executing block until 
the block is confirmed. This enables each block to have 
more than just the terminal side-effect and allows the 
compiler flexibility in choosing the optimal block size 
independent of how many side-effects are present 
within the block. 

Another important feature of using the cache as a 
dependency tracking mechanism is that when a transac- 
tion aborts, the valid entries in the cache remain, so that 
the re-execution of the block will likely achieve an 
almost 100% cache hit rate, reducing memory bus traf- 
fic and improving processor speed. 

The Lisp operation cons, although it performs a write 
into main memory, is not a side effecting instruction, 
since it is guaranteed that there is no other pointer to the 
location being written, and thus no other block can be 
affected by its change. In the block execution architec- 
ture, this can be implemented by providing each proces- 
sor with an independent free pointer. Consing and other 
local memory allocating writes done within a block are 
performed using the free pointer. The value of the 
pointer is saved when the block is entered, and abort 
rests the pointer to its entry value, automatically re- 
claiming the allocated storage. During the confirm, the 
free pointer's value is updated. All of these free pointer 
manipulations happen automatically if it is treated sim- 
ply as another variable whose value can be loaded and 
side effected. Writes of data into newly consed locations 
need not be confirmed. They can be best handled with 
a write-through technique in the dependency cache. It 
is important that the dependency cache not contain stale 
copies of data which has been written with a cons write 
operation. 

The addition of one primitive to the process instruc- 
tion set allows the re-introduction of explicit program- 
mer visible parallelism into the language supported by 
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this architecture. The primitive performs a load opera- 
tion without adding the location being loaded to the 
dependency list. In the depends cache, it reurns Valid 
data if present, and reads memory if necessary to pro- 
duce valid data. It never changes the state of the cache J 
to Depends. 

With this primitive, it is possible to express algo- 
rithms which compute results based on potentially stale 
data. In those cases where this is acceptable from an 
algorithmic point of view, then parallel execution with 10 
no inter-block dependencies can be supported. One 
example is the Gauss-Jacobi parallel iterative matrix 
solution technique, as contrasted with the sequential 
Gauss-Seidel technique. 

In an alternative hardware implementation shown in 15 
FIG. 6, dependencies are stored using a hashing tech- 
nique rather than a fully associative cache. This imple- 
mentation may be attractive because the RAM used is 
less expensive than the fully associated storage used in 
the cache based design. 20 

The circuit includes processors 11a, 116, lie con- 
nected through the confirm cache 12a, 126, 12c to the 
shared system bus 16 with memories 15a, 15b. The cir- 
cuit also includes a hash function box 13a, 12b, 13c 
which receives the main memory address which is for 25 
example thirty-two bits and hashes it into a reproducible 
hashed address of for example 12 bits. The twelve bit 
address is used to address a 4096 X 1 RAM 14a, 146, 14c. 
At the start of an instruction block execution in proces- 
sor 11a, the Clear signal is asserted, and the RAM 14a is 30 
cleared, either by writing each location to a logic zero, 
or through special means for simultaneously clearing all 
locations. Each time processor 11a performs a memory 
load operation it asserts the write signal, and writes a 
logic one into RAM 14a at the location addressed by 35 
the hash function. Each time processor lib terminates a 
block by performing its last side effect, the memory 
store from processor 116 to memory 15a or 15c is visible 
to processor 11a. Hash box 13a calculates the hashed 
address, and examines the bit in RAM 14a correspond- 40 
ing to the hashed address being written by processor 
116. It this bit is a logic one, then the block just con- 
firmed on processor 116 has potentially (but not neces- 
sarily) invalidated by computation being performed by 
processor 11a. Processor 11a must abort, clear RAM 45 
14a, and retry executing the block. 

What is claimed is: 

1. In a parallel processing system receptive of a pro- 
gram and having at least two processors connected in 
parallel to a shared main memory, wherein each proces- 50 
sor executes instructions of a program which includes 
side effecting instructions which modify the contents of 
a location in main memory and functional instructions 
which reference locations in main memory or operate 
only on internal processor registers, the improvement 55 
comprising: means for compiling a program into a series 
of independent instruction blocks including predomi- 
nantly functional instructions and terminating in a side 
effecting instruction; means for applying compiled 
blocks to the at least two processors for executing the 60 
blocks in parallel; and means for validating data to be 
stored in a location of main memory from each parallel 
processed block relative to locations referenced during 
execution by the other of the at least two processors, 
wherein the validating means comprises, for each pro- 65 
cessor, means for maintaining a dependency list of all 
locations in main memory which have been referenced 
during the execution of a block therein. 



2. The system according to claim 1, wherein the vali- 
dating means comprises first cache means associated 
with each processor for temporarily storing the location 
and contents of each location in main memory to be side 
effected by the execution of the block therein and means 
for confirming each block in the order of the blocks as 
specified in the program wherein the validity of the 
contents of the first cache means are approved compris- 
ing means for comparing the dependency list of the 
block to be confirmed with the locations side effected 
by already confirmed blocks to detect a match and 
means for aborting the execution of the block to be 
confirmed if there is a match. 

3. The system according to claim 2, wherein the con- 
firming means comprises a block counter and means for 
incrementing the block counter to indicate the next 
block to be confirmed. 

4. The system according to claim 2, wherein the 
means for maintaining a dependency list comprises sec- 
ond cache means associated with each processor for 
temporarily storing each location and contents of main 
memory referenced by a block being executed in the 
processor and wherein the confirming means comprises 
means for updating the contents of the second cache 
means upon the aborting of a block and for effecting the 
reexecution of an aborted block with the side effected 
location from an already confirmed block. 

5. The system according to claim 4, wherein the first 
and second cache means each comprises a fully associa- 
tive content addressable memory. 

6. The system according to claim 4, wherein the sec- 
ond cache means comprises an addressable random 
access memory for storing dependency information of 
main memory locations and hash table means receptive 
of said addresses of the main memory locations for 
hashing them into a reduced member of address bits for 
addressing the random access memory. 

7. The system according to claim 1, wherein the 
means for maintaining a dependency list includes means 
for selectively referencing locations in main memory 
without adding same to the dependency list. 

8. In a method for executing a program in at least two 
processors connected in parallel to a shared main mem- 
ory, wherein the program includes side effecting in- 
structions which modify the contents of a location in 
main memory and functional instructions which refer- 
ence locations in main memory, the improvement com- 
prising: compiling the program into a series of indepen- 
dent instruction blocks including predominantly func- 
tional instructions and terminating in a side effecting 
instruction, applying compiled blocks to the processors, 
executing the blocks in parallel and validating data to be 
stored in a location of main memory from each parallel 
processed block relative to locations referenced during 
execution by the other of the at least two processors by 
maintaining a dependency list for each processor of all 
locations in main memory which have been referenced 
during the execution of a block therein. 

9. The method according to claim 8, wherein the step 
of validating further comprises for each processor tem- 
porarily storing the location and contents of each loca- 
tion in main memory to be side effected by the execu- 
tion of the block therein and confirming each block in 
the order of the blocks as specified in the program to 
approve the validity of the stored contents by compar- 
ing the dependency list of the block to be confirmed 
with the locations which have been side effected by the 
already confirmed blocks to detect a match. 



